I am doing something like that right now, except instead of html files
I am parsing a set of do-files saved at different points over the last
two years. I want to read each file line by line and write the lines
that start with a specific string to a text file next to the date when
that do-file was saved, to have a record of how this particular chunk
of code changed over time.
So this message has two parts: first I'll try to help out Sebastian.
Then I'll tell you where I ran aground.
1. I am doing this file reading and writing for the first time in
Stata (I normally use PHP for that, which I know is quite a
workaround, but that's another story). I find that the "file" section
of the Stata 10 manual (p.140 of the [P] book) has everything that
Sebastian needs. But one concrete suggestion would be this:
tempname fh_in fh_out
local myfilein "your html file here"
local myfileout "your text file here"
local linenum=0
file open `fh_in' using `myfilein', read
file open `fh_out' using `myfileout', write
file read `fh_in' line
while r(eof)==0 {
local linenum=`linenum'+1
if regexm("`macval(line)'","<your html tag of interest here>*</tag over>") {
local myline=regexs(number of subexpression of interest here, see URL below)
local len=length("`'myline")
di "`myline'"
file write `fh_out' %`len's "`myline'" _n
}
file read `fh_in' line
}
file close `fh_in'
file close `fh_out'
For details on subexpression numbers 0-9 see here:
http://www.stata.com/support/faqs/data/regex.html.
2. And here's what ails me. My do-files have some comment sections
like the one shown below:
*Education Classification
*------------------------
*1. College Grad+ … where Master >150 and Bachelor >150
*2. College … where Bachelor >105
If you move with the cursor over the ...'s above, you will see that
these are not three separate periods. They are some kind of
dot-dot-dot special character. This trips up the file read command,
look:
… where Grade School >150" invalid name
r(198);
Does anybody know how to get Stata to run through such characters? I
remember a long time ago I had a similar problem with some
double-quotes that I cut and pasted into the do-file editor. They came
from MS Word and were pretty (like so: " ") and Stata snorted on them.
It wanted them plain (like so " "). At the time I just made a mental
note to always use a text editor for code, and that was that. But what
can I do now?
Thank you,
Gabi
On Thu, Apr 3, 2008 at 12:20 AM, Sebastian Bauhoff <[email protected]> wrote:
> Dear Statalisters,
>
> I need to download a large number of html files from the internet and parse their content. The structure of the html pages is always the same, and I need to extract only a small part that is identified within the html code. I would like to use Stata to download the files, extract the information I want, and save the result in a dataset. Any suggestions or pointers much appreciated.
>
> Thanks,
> Sebastian
> *
> * For searches and help try:
> * http://www.stata.com/support/faqs/res/findit.html
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/