[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: collecting raw data from the web via browser automation

From	Kit Baum <[email protected]>
To	[email protected]
Subject	st: collecting raw data from the web via browser automation
Date	Mon, 22 May 2006 18:52:00 -0400

Austin said

The trouble is this: the link to bibliographic data is not a static
page; it is generated on the fly, so Stata cannot -copy- to a local
file to -infile- the info. I will need a browser to browse to that
location, and then save the results. Does anyone have a freeware
solution to this problem? I have access to several varieties of
Windows and Unix/Linux, but no Mac OS options. What I am thinking is
that if there is a command line browser with the option to save the
page to disk, I can just invoke the page and save it with a single
line of code that begins with the -shell- command, and then infile it
with another that begins -infile-.

One thing to remember: if you can do it in Unix/Linux, you can always do it in Mac OS X, which is after all Unix with a Mac face.

On Mac OS X, either wget or curl will do what you want. I.e.

curl http://www.hsph.harvard.edu/cgi-bin/lwgate/STATALIST/archives/ statalist.0605/Date/article-780.html > austin.html

Perl is an excellent tool to grab web pages and turn them into text files (perhaps after stripping html tags). See a number of the scripts I have written in RePEc under software->RePEc team for examples (one, for instance, snarfs the AEA's XML data for the A.E.R. and turns it into RePEc templates).

Kit Baum, Boston College Economics
http://ideas.repec.org/e/pba1.html

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: collecting raw data from the web via browser automation
  - From: Phil Schumm <[email protected]>

Prev by Date: st: creating polynomial expansion
Next by Date: RE: Re: st: Linear Trend Tests of ORs
Previous by thread: st: collecting raw data from the web via browser automation
Next by thread: Re: st: collecting raw data from the web via browser automation
Index(es):
- Date
- Thread