| |
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: collecting raw data from the web via browser automation
On May 22, 2006, at 10:54 PM, Michael Blasnik wrote:
I'm not sure if any of these tools can actually solve the problem
originally posted.
Yes, they can. Both curl and wget support authentication, cookies,
SSL, and the use of HTTP POST (in addition to GET) to submit a
request. And with either Python or Perl, you can script an entire
web session, including passing through multiple forms, with each
subsequent request dependent on the result(s) returned from the last.
As a later post indicated, you can use Stata's -copy- to retrieve a
page using GET (i.e., parameters encoded in the actual URL), and in
this way initiate a search with Google Scholar. However, the URL in
the original posting resulted from clicking on the "Import into..."
link corresponding to a single item from the list of items returned
by a search. I'm not sure how this selection would be made
programmatically, or, if the intention was to grab the information on
all of the top n items (note that depending upon how large n is, this
might be spread across multiple results pages, due to the way results
are batched). Moreover, the format of the data returned by the
original URL depends upon how your "Scholar Preferences" are set
(i.e., which bibliographic format), and these preferences are
probably stored in a cookie. Finally, regardless of the export
format chosen, you may still need to do some post-processing before
reading the "data" into Stata. Thus, even though the initial search
can be triggered with -copy-, one of the other suggested tools may
well be necessary to complete the entire task (or at least to do so
in an efficient way).
On May 22, 2006, at 4:21 PM, Austin Nichols wrote:
Google Scholar has a nice way to set Preferences so that links to
bibliographic info are generated in the search results, but I don't
use BibTeX or EndNote or any of those things--I use Stata, and I
want to automate the whole process of seaching and saving those
data (which look like http://scholar.google.com/scholar.bib?
q=info:nmXVGJVxYjQJ:scholar.google.com/&output=citation by the way)
and infiling them into Stata so I can have a nice database of
articles made for me on any set of search terms I put in.
I meant to comment on this before, but forgot. As much as I love to
see new uses for Stata, I would strongly urge you to look at one of
the free programs available for managing BibTex files (e.g.,
tkbibtex, BibTool, or Bibcursed (multi-platform), BibDesk (my
personal favorite; OS X only), or BibEdit or BibDB (Windows only)).
Also, many text editors provide tools for working with files in
BibTeX format. As you know, you can export directly into BibTeX
format from Scholar. Even if you don't actually use BibTeX when
writing, these tools may permit you to accomplish what you need, and
may suggest other things you hadn't thought of.
-- Phil
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/