Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: reading HTML source in Chinese but get a messy code


From   "Li Chuntao (Tony)" <[email protected]>
To   [email protected]
Subject   Re: st: reading HTML source in Chinese but get a messy code
Date   Sat, 8 Jun 2013 22:31:09 +0800

Thank you, Sergiy and Nick. You help me out!

Chuntao

On Fri, Jun 7, 2013 at 2:36 AM, Sergiy Radyakin <[email protected]> wrote:
> Chuntao,
>
> adding to Nick's comments, you don't have to parse HTML code yourself
> as this is a pretty standard task. For your purposes the following
> should yield a pretty clean file:
> http://html2text.theinfo.org/?url=http%3A%2F%2Fqq.ico.la%2Fqq459322466.html
>
> where you supply your URL as a parameter.
>
> Best, Sergiy Radyakin
>
>
> On Thu, Jun 6, 2013 at 12:58 PM, Nick Cox <[email protected]> wrote:
>> If a file contains junk in lines 1 to 31, don't skip lines 1 to 34!
>>
>> A more fundamental point is that this is HTML:
>>
>> 1. So, lines will necessarily include HTML markup code in many if not
>> all lines. You will need to strip those too, or interpret them.
>>
>> 2. Mark-up code won't necessarily be interpretable if you ignore previous lines.
>>
>> In this particular case, there are many references to yet other files,
>> perhaps not of concern to you.
>>
>> I can't read Chinese, so that is far as I go.
>>
>> Nick
>> [email protected]
>>
>>
>> On 6 June 2013 14:36, Li Chuntao (Tony) <[email protected]> wrote:
>>> Dear Listers,
>>>
>>>        I want to import the following HTML source files:
>>>
>>>         http://qq.ico.la/qq459322466.html
>>>
>>>         The source file contains some information in Chinese, which is
>>> located in line 32 to 73.
>>>
>>>          i tried to import the information by using the following code:
>>>
>>> clear all
>>> set obs 500
>>> copy  "http://qq.ico.la/qq459322466.html"; d:\qq.txt, replace
>>>
>>> mata:
>>>         fh = fopen("d:\qq.txt", "r")
>>>         for(i=1; i<=34; i++) {
>>>         junk=fget(fh)
>>>         }
>>>         for(i=; i<=20; i++) {
>>>         junk=fget(fh)
>>>         junk
>>>         }
>>>
>>> end
>>>
>>> but the result data in memory is only a messy.
>>>
>>> Similar code has been used for other webpage, thanks to Prof. Kit
>>> Baum, as can be seen following:
>>>
>>> clear all
>>> set obs 500
>>> local stkcd="000002"
>>> gen str20 date="2012.12.31"
>>> copy "http://stockdata.stock.hexun.com/2008/lr.aspx?stockid=`stkcd'&accountdate=2012.12.31"
>>>  d:\date.txt, replace
>>> mata:
>>>         fh = fopen("d:\date.txt", "r")
>>>         for(i=1; i<=444; i++) {
>>>         junk=fget(fh)
>>>         }
>>>
>>> Can someone familiar with Chinese encoding give me some hits?
>>>
>>> Best
>>>
>>> Chuntao
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index