Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: reading HTML source in Chinese but get a messy code
From
Sergiy Radyakin <[email protected]>
To
"[email protected]" <[email protected]>
Subject
Re: st: reading HTML source in Chinese but get a messy code
Date
Thu, 6 Jun 2013 14:36:48 -0400
Chuntao,
adding to Nick's comments, you don't have to parse HTML code yourself
as this is a pretty standard task. For your purposes the following
should yield a pretty clean file:
http://html2text.theinfo.org/?url=http%3A%2F%2Fqq.ico.la%2Fqq459322466.html
where you supply your URL as a parameter.
Best, Sergiy Radyakin
On Thu, Jun 6, 2013 at 12:58 PM, Nick Cox <[email protected]> wrote:
> If a file contains junk in lines 1 to 31, don't skip lines 1 to 34!
>
> A more fundamental point is that this is HTML:
>
> 1. So, lines will necessarily include HTML markup code in many if not
> all lines. You will need to strip those too, or interpret them.
>
> 2. Mark-up code won't necessarily be interpretable if you ignore previous lines.
>
> In this particular case, there are many references to yet other files,
> perhaps not of concern to you.
>
> I can't read Chinese, so that is far as I go.
>
> Nick
> [email protected]
>
>
> On 6 June 2013 14:36, Li Chuntao (Tony) <[email protected]> wrote:
>> Dear Listers,
>>
>> I want to import the following HTML source files:
>>
>> http://qq.ico.la/qq459322466.html
>>
>> The source file contains some information in Chinese, which is
>> located in line 32 to 73.
>>
>> i tried to import the information by using the following code:
>>
>> clear all
>> set obs 500
>> copy "http://qq.ico.la/qq459322466.html" d:\qq.txt, replace
>>
>> mata:
>> fh = fopen("d:\qq.txt", "r")
>> for(i=1; i<=34; i++) {
>> junk=fget(fh)
>> }
>> for(i=; i<=20; i++) {
>> junk=fget(fh)
>> junk
>> }
>>
>> end
>>
>> but the result data in memory is only a messy.
>>
>> Similar code has been used for other webpage, thanks to Prof. Kit
>> Baum, as can be seen following:
>>
>> clear all
>> set obs 500
>> local stkcd="000002"
>> gen str20 date="2012.12.31"
>> copy "http://stockdata.stock.hexun.com/2008/lr.aspx?stockid=`stkcd'&accountdate=2012.12.31"
>> d:\date.txt, replace
>> mata:
>> fh = fopen("d:\date.txt", "r")
>> for(i=1; i<=444; i++) {
>> junk=fget(fh)
>> }
>>
>> Can someone familiar with Chinese encoding give me some hits?
>>
>> Best
>>
>> Chuntao
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/