[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Read HTML file with Stata

From	"Eric A. Booth" <ebooth@ppri.tamu.edu>
To	statalist@hsphsun2.harvard.edu
Subject	Re: st: Read HTML file with Stata
Date	Tue, 14 Jul 2009 08:46:53 -0500

My previous message had some issues with line wrapping, so here is thecode again:

**********************
clear
local sf "`pwd'"

**get some text / information from any webpage**

copy http://www.stata.com/support/updates/stata9.html "`sf'test.txt",replace**alternate** !curl -o "`sf'test.txt" http://www.stata.com/support/updates/stata9.html--anyauth --ignore-content-length

**clean it up a bit by filtering out the HTML tags I don't need**
tokenize <tr> </tr> <b> </b> <i> </i> <td> </td> </a> </th>
while "`1'" != "" {
	 filefilter "`sf'test.txt" "`sf'testnew.txt",  /*
	     */ from("`1'") to("") replace
	 filefilter "`sf'testnew.txt" "`sf'test.txt",  /*
	     */ from("\Q") to("%") replace
	mac shift
    }

/*                       NOTE
Assuming your data is very structured/consistent,
you could just -keep- the variables
that contain your target information (so,
here you could keep rows 194/217 )
If you want to find certain fields or data
 indicated by some tag, use these steps
*/


**import the information**
intext using "`sf'test.txt", gen(html) length(90)
save "`sf'test.dta", replace

/*
Assuming your data is very structured/consistent,
you could just -keep- the variables
that contain your target information...so,
here you could keep rows 194/217
if you want to find certain fields or data indiated after some tag:
*/

/*
For the stata9 webpage,
we could use # as the indicator
for the row/column headings
and we could use "<a href=" as an
indicator of the cell values
*/

findval "#", substr gen(flag_headings)
findval "<a href=", substr gen(flag_cells)
drop if flag_headings==0 & flag_cells==0
keep html1
**

**Finally,  split the vals, create -substr- to clean up

gen colheadings = strpos(html1, "#CFCFCF;%>")
gen rowheadings = strpos(html1, "#EFEFEF;%>")
gen cells_win = strpos(html1, "/win/%>")
gen cells_mac = strpos(html1, "/mac/%>")
drop if colheadings==0 & rowheadings==0 &  /*
   */  cells_win==0 &  cells_mac==0
**
foreach v in colheadings rowheadings cells_win cells_mac {
    gen str20 `v'2 = ""
}

replace colheadings2 = substr(html1, colheadings+10,.)  /*
    */ if colheadings>0
replace rowheadings2 = substr(html1, rowheadings+10,.)  /*
    */ if rowheadings>0
replace cells_win2 = substr(html1, cells_win+7,.)  /*
    */ if cells_win>0
replace cells_mac2 = substr(html1, cells_mac+7,.)  /*
    */ if cells_mac>0

**drop colheadings rowheadings cells_win  /*
    */ cells_mac html1
list, noobs sep(1) div
browse
save "`sf'test.dta", replace
*******************************


Eric

__
Eric A. Booth
Public Policy Research Institute
Texas A&M University
ebooth@ppri.tamu.edu
Office: +979.845.6754




On Jul 13, 2009, at 11:26 PM, Eric A. Booth wrote:

Hello Friedrich:
It might help if we get a bit more detail on what your HTML filelooks like (e.g., is it a webpage with embedded tables of some sort;does it just contain data that is delimited or broken into sectionswith tags; etc.)--I think it would be difficult to create a add-onor program that simply converts any HTML file to .XLS.
It's probably not the most elegant solution, but I was automatingthe extraction of some text from a table embedded in a webpage byusing -filefilter- and some string functions to get rid of the HTMLtags, etc. -- leaving only the data I needed. I then used -intext-to bring the information into Stata before cleaning it up some more.
For a rough example:
Say I wanted to extract the elements from the small table embeddedin this Stata webpage: http://www.stata.com/support/updates/stata9.html.
Given that there is a lot of extra information on the page, I wouldneed to look at the source code of the HTML web page and get rid ofthe extra information & get at the elements in that small embeddedtable. (Friedrich's issue may be much simpler in that his HTMLfile might only be formatted data with a couple of HTML tags/stringsat the top & bottom of the file that can be filtered out using -filefilter-)
Here's how I might extract the elements from this Stata web pagetable into a .dta file (I would appreciate any other users'comments on how to make this process easier or more efficient) :
**********************
clear
local sf "`pwd'"
**get some text / information from any webpage using !curl (in linux/unix) or -copy- **copy http://www.stata.com/support/updates/stata9.html"`sf'test.txt", replace**alternate** !curl -o --anyauth --ignore-content-length"`sf'test.txt" http://www.stata.com/support/updates/stata9.html
**clean it up a bit by filtering out the HTML tags I don't need (youcould put all possible HTML tags in this list)**
tokenize <tr> </tr> <b> </b> <i> </i> <td> </td> </a> </th>
while "`1'" != "" {
filefilter "`sf'test.txt" "`sf'testnew.txt", from("`1'") to("")replacefilefilter "`sf'testnew.txt" "`sf'test.txt", from("\Q") to("%")replace
	/*I find that it's useful to get rid of the quotes at this point */
	mac shift
   }
intext using "`sf'test.txt", gen(html) length(90)
save "`sf'test.dta", replace


/*                            NOTE
Assuming your data is very structured/consistent,
at this point you could just -keep- the variables
that contain your target information (so,
here you could keep rows 194/217 )
If you want to find certain fields or data
indicated by some tag, use the steps below:
*/

**

/*                            NOTE
For this Stata9 web page table,
we could use # as the indicator for the row/column headings
and we could use "<a href=" as an indicator of the cell values
*/

findval "#", substr gen(flag_headings)
findval "<a href=", substr gen(flag_cells)
drop if flag_headings==0 & flag_cells==0
keep html1
**

**Finally,  split  html1 & create a -substr- to clean up**

gen colheadings = strpos(html1, "#CFCFCF;%>")
gen rowheadings = strpos(html1, "#EFEFEF;%>")
gen cells_win = strpos(html1, "/win/%>")
gen cells_mac = strpos(html1, "/mac/%>")
drop if colheadings==0 & rowheadings==0 & cells_win==0 &cells_mac==0
**
foreach v in colheadings rowheadings cells_win cells_mac {
   gen str20 `v'2 = ""
}
replace colheadings2 = substr(html1, colheadings+10,.) ifcolheadings>0replace rowheadings2 = substr(html1, rowheadings+10,.) ifrowheadings>0
replace cells_win2 = substr(html1, cells_win+7,.) if cells_win>0
replace cells_mac2 = substr(html1, cells_mac+7,.) if cells_mac>0
**
drop colheadings rowheadings cells_win cells_mac html1
list, noobs sep(1) div
save "`sf'test_final.dta", replace
************************
This leaves me with the following table, which you could then use topluck out the elements you needed for values in a table or variable/value labels elsewhere:
+----------------------------------------------------------------------------+| colhea~2 | rowheadings2 | cells_win2 |cells_mac2 ||----------+-------------------+---------------------+-----------------------|| Update | || ||----------+-------------------+---------------------+-----------------------|| Platform | || ||----------+-------------------+---------------------+-----------------------|| | Stata executables || ||----------+-------------------+---------------------+-----------------------|| | | Stata 9 forWindows | ||----------+-------------------+---------------------+-----------------------|| || | Stata 9 for Macintosh ||----------+-------------------+---------------------+-----------------------|| | Stata ado-files || |
One of the many advantages of using !curl (or !url2file if you'vegot Windows), in lieu of Stata's -copy-, is that you can send a pre-filled form to the website by utilizing the "--form <name=content>"option. So, if you are downloading all the tables generated by asite that requests that you fill out a form or select some formattributes before you get each file, you could send the informationthat should go into those fields using !curl & automaticallydownload the data (assuming there is no CAPTCHA).
Again, I do too much of this type of thing, so I too would also beinterested in any suggestions or other approaches to solvingFriedrich's HTML conversion problem.
Best,

Eric


__
Eric A. Booth
Public Policy Research Institute
Texas A&M University
ebooth@ppri.tamu.edu
Office: +979.845.6754




On Jul 13, 2009, at 5:05 PM, Friedrich Huebler wrote:
I have a set of Excel files that I convert to Stata format with
Stat/Transfer, called from Stata with -stcmd- by Roger Newson. Someofthe original files are HTML files with an XLS extension that cannotbeconverted by Stat/Transfer. I can open these files with Excel andsave
them in native Excel format but would prefer a solution that does not
involve Excel. Can anyone recommend a method to read HTML files into
Stata? There are a number of add-ons that allow export to HTML format
but I found nothing that goes the other way, from HTML to Stata.

Thanks,

Friedrich
*
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Read HTML file with Stata
  - From: Neil Shephard <nshephard@gmail.com>

References:
- st: Read HTML file with Stata
  - From: Friedrich Huebler <fhuebler@gmail.com>
- Re: st: Read HTML file with Stata
  - From: "Eric A. Booth" <ebooth@ppri.tamu.edu>

Prev by Date: Re: st: Read HTML file with Stata
Next by Date: Re: st: Read HTML file with Stata
Previous by thread: Re: st: Read HTML file with Stata
Next by thread: Re: st: Read HTML file with Stata
Index(es):
- Date
- Thread