|
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: Download and parse html files (and regex trouble)
.
I am often not able to copy and paste code from the list and have it
run as the author intended. Opening such text in BBedit or other text
editor that can see text gremlins often tells me I have invisible
characters that are the problem.
-Dave
On Apr 3, 2008, at 7:23 AM, Austin Nichols wrote:
Gabi Huiber <[email protected]> and Sebastian Bauhoff <[email protected]
>:
I suspect Gabi has some extended ASCII characters in there causing the
trouble that one might not even be able to see, and will not appear in
plain text email to the list. Try analyzing the file with
hexdump `file', analyze results
and then use -filefilter- to remove or replace potentially problematic
characters before further processing.
For Sebastian's problem, he may be able to do something like this:
copy http://whatever.com/some.html i.html, replace
insheet using i.html
g firstline=substr(v1,1,9)=="something"
g lastline=substr(v1,1,14)=="something else"
g keep=sum(firstline)-sum(lastline)
list if keep
but -insheet- will also choke on extended ASCII characters, so
-hexdump- and -filefilter- may be required first.
On Thu, Apr 3, 2008 at 3:06 AM, Gabi Huiber <[email protected]> wrote:
In an earlier response of mine to this post I blamed the ...
(dot-dot-dot) special character for breaking my file read code. That
was not the reason.
The command file read `fh' line chokes on do-file lines where a
comment is inserted before the end of the line with the double
forward
slash syntax. I have no idea how to make that go away. I tried
enclosing my file read/file write routine within this if-condition:
if !regexm("macval(`line')","[[a-zA-Z0-9][:punct:]]*\/\/"){
read line in this file
write line in that file
}
But that had no effect.
Gabi
On Thu, Apr 3, 2008 at 12:20 AM, Sebastian Bauhoff <[email protected]
> wrote:
Dear Statalisters,
I need to download a large number of html files from the internet
and parse
their content. The structure of the html pages is always the
same, and I
need to extract only a small part that is identified within the
html code.
I would like to use Stata to download the files, extract the
information I
want, and save the result in a dataset. Any suggestions or
pointers much
appreciated.
Thanks,
Sebastian
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
--
David C. Airey, Ph.D.
Pharmacology Research Assistant Professor
Center for Human Genetics Research Member
Department of Pharmacology
School of Medicine
Vanderbilt University
Rm 8158A Bldg MR3
465 21st Avenue South
Nashville, TN 37232-8548
TEL (615) 936-1510
FAX (615) 936-3747
EMAIL [email protected]
URL http://people.vanderbilt.edu/~david.c.airey/dca_cv.pdf
URL http://www.vanderbilt.edu/pharmacology
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/