Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: reading HTML source in Chinese but get a messy code
From
"Li Chuntao (Tony)" <[email protected]>
To
[email protected]
Subject
Re: st: reading HTML source in Chinese but get a messy code
Date
Sat, 8 Jun 2013 23:04:49 +0800
Yes, of course i understand my own code. Here i just want to display
the first two lines to show that there is a messay output and seeking
helps.
Thank you, Nick, for your always kind help helpfulness
Tony
On Sat, Jun 8, 2013 at 10:59 PM, Nick Cox <[email protected]> wrote:
> Your own code doesn't seem well matched to the input. In your first
> post you were looping over the lines of the file, reading them one by
> one and then processing them. You have abandoned that here. Do you
> understand what the original Mata code does?
> Nick
> [email protected]
>
>
> On 8 June 2013 15:51, Li Chuntao (Tony) <[email protected]> wrote:
>> Well, it still does not work, as can be seen from the output of the
>> following codes:
>>
>> I mean, the output from http://html2text.theinfo.org seems quite
>> clean, but it turns to a messay when i tried to read it into Stata,
>> weather by insheet using or by the Mata code followed.
>>
>> Do anyone have such an experience?
>>
>>
>> thanks
>>
>> Chuntao
>>
>>
>> copy "http://html2text.theinfo.org/?url=++http%3A%2F%2Fqq.ico.la%2Fqq459322464.html"
>> d:\temp.txt, replace
>> mata:
>> fh = fopen("d:\temp.txt", "r")
>> junk=fget(fh)
>> junk
>> junk=fget(fh)
>> junk
>>
>> }
>>
>>
>>
>> On Fri, Jun 7, 2013 at 2:36 AM, Sergiy Radyakin <[email protected]> wrote:
>>> Chuntao,
>>>
>>> adding to Nick's comments, you don't have to parse HTML code yourself
>>> as this is a pretty standard task. For your purposes the following
>>> should yield a pretty clean file:
>>> http://html2text.theinfo.org/?url=http%3A%2F%2Fqq.ico.la%2Fqq459322466.html
>>>
>>> where you supply your URL as a parameter.
>>>
>>> Best, Sergiy Radyakin
>>>
>>>
>>> On Thu, Jun 6, 2013 at 12:58 PM, Nick Cox <[email protected]> wrote:
>>>> If a file contains junk in lines 1 to 31, don't skip lines 1 to 34!
>>>>
>>>> A more fundamental point is that this is HTML:
>>>>
>>>> 1. So, lines will necessarily include HTML markup code in many if not
>>>> all lines. You will need to strip those too, or interpret them.
>>>>
>>>> 2. Mark-up code won't necessarily be interpretable if you ignore previous lines.
>>>>
>>>> In this particular case, there are many references to yet other files,
>>>> perhaps not of concern to you.
>>>>
>>>> I can't read Chinese, so that is far as I go.
>>>>
>>>> Nick
>>>> [email protected]
>>>>
>>>>
>>>> On 6 June 2013 14:36, Li Chuntao (Tony) <[email protected]> wrote:
>>>>> Dear Listers,
>>>>>
>>>>> I want to import the following HTML source files:
>>>>>
>>>>> http://qq.ico.la/qq459322466.html
>>>>>
>>>>> The source file contains some information in Chinese, which is
>>>>> located in line 32 to 73.
>>>>>
>>>>> i tried to import the information by using the following code:
>>>>>
>>>>> clear all
>>>>> set obs 500
>>>>> copy "http://qq.ico.la/qq459322466.html" d:\qq.txt, replace
>>>>>
>>>>> mata:
>>>>> fh = fopen("d:\qq.txt", "r")
>>>>> for(i=1; i<=34; i++) {
>>>>> junk=fget(fh)
>>>>> }
>>>>> for(i=; i<=20; i++) {
>>>>> junk=fget(fh)
>>>>> junk
>>>>> }
>>>>>
>>>>> end
>>>>>
>>>>> but the result data in memory is only a messy.
>>>>>
>>>>> Similar code has been used for other webpage, thanks to Prof. Kit
>>>>> Baum, as can be seen following:
>>>>>
>>>>> clear all
>>>>> set obs 500
>>>>> local stkcd="000002"
>>>>> gen str20 date="2012.12.31"
>>>>> copy "http://stockdata.stock.hexun.com/2008/lr.aspx?stockid=`stkcd'&accountdate=2012.12.31"
>>>>> d:\date.txt, replace
>>>>> mata:
>>>>> fh = fopen("d:\date.txt", "r")
>>>>> for(i=1; i<=444; i++) {
>>>>> junk=fget(fh)
>>>>> }
>>>>>
>>>>> Can someone familiar with Chinese encoding give me some hits?
>>>>>
>>>>> Best
>>>>>
>>>>> Chuntao
>>>>> *
>>>>> * For searches and help try:
>>>>> * http://www.stata.com/help.cgi?search
>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>> *
>>>> * For searches and help try:
>>>> * http://www.stata.com/help.cgi?search
>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>> * http://www.ats.ucla.edu/stat/stata/
>>> *
>>> * For searches and help try:
>>> * http://www.stata.com/help.cgi?search
>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>> * http://www.ats.ucla.edu/stat/stata/
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/