| |
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
st: collecting raw data from the web via browser automation
Austin said
The trouble is this: the link to bibliographic data is not a static
page; it is generated on the fly, so Stata cannot -copy- to a local
file to -infile- the info. I will need a browser to browse to that
location, and then save the results. Does anyone have a freeware
solution to this problem? I have access to several varieties of
Windows and Unix/Linux, but no Mac OS options. What I am thinking is
that if there is a command line browser with the option to save the
page to disk, I can just invoke the page and save it with a single
line of code that begins with the -shell- command, and then infile it
with another that begins -infile-.
One thing to remember: if you can do it in Unix/Linux, you can always
do it in Mac OS X, which is after all Unix with a Mac face.
On Mac OS X, either wget or curl will do what you want. I.e.
curl http://www.hsph.harvard.edu/cgi-bin/lwgate/STATALIST/archives/
statalist.0605/Date/article-780.html > austin.html
Perl is an excellent tool to grab web pages and turn them into text
files (perhaps after stripping html tags). See a number of the
scripts I have written in RePEc under software->RePEc team for
examples (one, for instance, snarfs the AEA's XML data for the A.E.R.
and turns it into RePEc templates).
Kit Baum, Boston College Economics
http://ideas.repec.org/e/pba1.html
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/