Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: question : linking results of web queries to data
From
László Sándor <[email protected]>
To
[email protected]
Subject
Re: st: question : linking results of web queries to data
Date
Fri, 17 Dec 2010 15:50:56 -0500
Thank you, Eric, this is a great solution.
Much better than hoping to code up the right Perl script and call it.
Laszlo
On Fri, Dec 17, 2010 at 2:50 PM, Eric Booth <[email protected]> wrote:
>
> <>
>
> It looks like Laszlo is asking to grab the "short id" from this query tool & merge it to a list of names or other information he already has in his database. I'll leave the IRB issues that Peter mentions up to Laszlo -- I assume he's just compiling a database of information from this publicly available website (though, if this is true, one approach you could always try is requesting this data from the website owner)
>
> I don't know how you could use this cgi search via Stata (and Stata may not be the best tool for this), but there are a couple of options for using Stata to get the elements you need from these webpages (though, since I don't know exactly what information you need, I don't know which of these is best):
>
> (1) if all you need is the "short id"s from this cgi query tool, you could just search 26 times for each letter "a", "b", ... and then copy and paste the list of short ids to a local file
>
> (2) you can get the same list of all the short id's linked to the authors' pages from this listing -- this avoids you having to use the cgi query tool repeatedly:
> http://ideas.repec.org/f/
>
>
> (3) you can get the list of all the authors' webpages (which includes their short id) from:
> http://ideas.repec.org/i/eall.html
>
> For this option, you can automate extracting the information you need from these pages by using -copy- to get the file to your machine,
>
>
> ***************!
> copy "http://ideas.repec.org/e/" "index.txt", replace public
>
> clear
> **you need -intext- from SSC**
> cap which intext
> if _rc ssc install intext, replace
>
> **be patient, this can take a while-->
> intext using "index.txt", g(v) length(100)
> split v1, p(`"href=""')
> split v12, p(`".html"')
>
> **v121 should contain the short IDs of interest**
> ds v121, not
> drop `r(varlist)'
> drop if mi(v121)
>
> **get rid of extra cells with html tags**
> foreach v in "<" ">" "/" {
> cap drop if index(v121, "`v'")
> }
> **now you've got a list of all the shortid's**
> levelsof v121, loc(shortid)
> foreach v in `shortid' {
> copy "http://ideas.repec.org/e/`v'.html" "`v'.txt", replace public
> *< use -intext- and -split- to get the fields you need and clean them up>*
> }
> ***************!
> I'll leave the last steps up to you, but you should be able to follow the same process I used to get the list of short id's, and instead extract other fields from the authors' HTML pages (e.g., their firstname, lastname, webpage, email, citations, affiliations, etc). Use -split- and other string functions (see -help string_functions-) to clean up your records. Once you clean up each author's page, you can append them all together and then merge the appended file to your main dataset via the "short id."
>
> - Eric
> __
> Eric A. Booth
> Public Policy Research Institute
> Texas A&M University
> [email protected]
> Office: +979.845.6754
>
>
>
>
>
> On Dec 17, 2010, at 10:50 AM, László Sándor wrote:
>
> > Hi all,
> >
> > I need to query a website for some extra data that I would link to my
> > existing one. I am using Stata 11.1 on Mac and Unix.
> >
> > My data has names of people, and I should query a site using CGI
> > (http://ideas.repec.org/cgi-bin/shortid.cgi) and collect a single
> > string from the resulting pages into a new variable.
> >
> > I don't know enough about Perl (etc.) to simply write the right
> > script, run it with -shell-, and get the data that way. I would
> > appreciate any guidance (tools, examples) on how this could be done, I
> > have not found this functionality in (and 'around') Stata so far.
> >
> > Thank you,
> >
> > Laszlo
> >
> >
> > László Sándor
> > PhD candidate in Economics
> > Harvard University
> >
> > *
> > * For searches and help try:
> > * http://www.stata.com/help.cgi?search
> > * http://www.stata.com/support/statalist/faq
> > * http://www.ats.ucla.edu/stat/stata/
>
>
>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/