| |
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: collecting raw data from the web via browser automation
From |
"Michael Blasnik" <[email protected]> |
To |
<[email protected]> |
Subject |
Re: st: collecting raw data from the web via browser automation |
Date |
Mon, 22 May 2006 23:54:04 -0400 |
I'm not sure if any of these tools can actually solve the problem originally
posted.
The example Kit gives shows accessing a static web page -- a page that
already exists "as is" and one you could also simply copy to your local
drive using Stata itself (copy http:/.../...) and then parse it as needed.
It's easy to download that data directly to Stata and I don't think that is
the problem.
I think what the original post asked for (and what I would be interested in
as well) is a way to access web pages that are only created when an action
is taken or selection is made on a different web page, so there is no
specific web address that holds the data you want. I have thought about
trying to use auto-it or another scripting language to launch a browser,
make selections on a web page and then capture the data that's spawned
typically in a new window.
Do any of the tools mentioned by Kit or Phil actually do this?
Michael Blasnik
[email protected]
----- Original Message -----
From: "Phil Schumm" <[email protected]>
To: <[email protected]>
Sent: Monday, May 22, 2006 10:31 PM
Subject: Re: st: collecting raw data from the web via browser automation
On May 22, 2006, at 5:52 PM, Kit Baum wrote:
On Mac OS X, either wget or curl will do what you want. I.e.
curl http://www.hsph.harvard.edu/cgi-bin/lwgate/STATALIST/archives/
statalist.0605/Date/article-780.html > austin.html
Perl is an excellent tool to grab web pages and turn them into text
files (perhaps after stripping html tags). See a number of the scripts I
have written in RePEc under software->RePEc team for examples (one, for
instance, snarfs the AEA's XML data for the A.E.R. and turns it into
RePEc templates).
To Kit's excellent answer, I would only add that Python is also a great
tool for screen scraping. In fact, what you are proposing is a pretty
common thing to do (I've done it occasionally myself, though not with
search results from Google Scholar). Note also that Perl and Python (and
probably also either wget or curl) come pre-installed under many OSes,
and can also be easily installed under Windows.
-- Phil
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/