| |
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: collecting raw data from the web via browser automation
On May 22, 2006, at 5:52 PM, Kit Baum wrote:
On Mac OS X, either wget or curl will do what you want. I.e.
curl http://www.hsph.harvard.edu/cgi-bin/lwgate/STATALIST/archives/
statalist.0605/Date/article-780.html > austin.html
Perl is an excellent tool to grab web pages and turn them into text
files (perhaps after stripping html tags). See a number of the
scripts I have written in RePEc under software->RePEc team for
examples (one, for instance, snarfs the AEA's XML data for the
A.E.R. and turns it into RePEc templates).
To Kit's excellent answer, I would only add that Python is also a
great tool for screen scraping. In fact, what you are proposing is a
pretty common thing to do (I've done it occasionally myself, though
not with search results from Google Scholar). Note also that Perl
and Python (and probably also either wget or curl) come pre-installed
under many OSes, and can also be easily installed under Windows.
-- Phil
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/