Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: reading HTML source in Chinese but get a messy code
From
Sergiy Radyakin <[email protected]>
To
"[email protected]" <[email protected]>
Subject
Re: st: reading HTML source in Chinese but get a messy code
Date
Mon, 10 Jun 2013 02:14:04 -0400
following statement in the documentation:"str# variables require #
bytes per observation".Sergiy.
On Mon, Jun 10, 2013 at 1:06 AM, Sergiy Radyakin <[email protected]> wrote:
> Tony,
>
> if your choice of package is based solely on whether it supports
> unicode or not, I would probably recommend Microsoft's Excel or
> OpenOffice's Calc. However since you are in this forum, you probably
> intend to do some statistical processing of that information. In that
> case what is that analysis? In many cases you actually don't need to
> see the text, but just rely on the package to handle it. If you don't
> want to specify, consider SPSS or SAS which (according to the
> manufacturers) both support unicode. I have also asked if new Stata 13
> supports unicode and hope for the best. If you want to harvest
> information about the user profiles, you will need to check with the
> site owner whether this would be permitted, and if you have valid
> scientific needs to do it, perhaps, the owner might simply pass that
> information to you in an organized way. From what it appears on this
> page you can pull some information like ID, age, gender, phone, and
> province, and the rest (name, address) is hardly of any value for
> statistical processing.
>
> Best, Sergiy
>
> On Sat, Jun 8, 2013 at 9:53 PM, Li Chuntao (Tony) <[email protected]> wrote:
>> Dear Sergiy,
>>
>> Thank you for your advice. Actually i need the whole lines of
>> information from Line 2~9. Maybe Stata just cannot handle it because
>> of the unicode problem. If you know any package can do it, please
>> advice.
>>
>> thanks again
>>
>> Tony
>>
>>
>> On Sun, Jun 9, 2013 at 12:09 AM, Sergiy Radyakin <[email protected]> wrote:
>>> Tony, after visiting the link I see in lines 2-9 characters in
>>> Chinese. Stata will not show you these characters because Stata does
>>> not work with unicode. To see the file through Stata's eyes, go to the
>>> link you posted in FireFox, then go to the menu View-->Character
>>> Encoding-->More Encodings-->West European-->Western(Windows-1252).
>>> This is what you can import into Stata and, yes, it does look messy.
>>> This is the best you can get with it. The good thing is that if you
>>> process your data in Stata and then output the same messy text you
>>> will end up with a very readable text, but readable elsewhere (e.g. in
>>> notepad or a browser). To cut it short, if your analysis requires e.g.
>>> search of a substring in a text - you might do it by searching for
>>> byte sequences, and those sequences would not look intuitive at all.
>>> But if it is something more involved then you might want to rethink
>>> the choice of a package to do it. Perhaps if you describe the broad
>>> goal of what you are doing it would be easier to advise.
>>> Best, Sergiy
>>>
>>> On Sat, Jun 8, 2013 at 11:20 AM, Li Chuntao (Tony) <[email protected]> wrote:
>>>> Dear Prof. Nick,
>>>>
>>>> Line 2 to 9 are what i want, from the page of
>>>> http://html2text.theinfo.org/?url=++http%3A%2F%2Fqq.ico.la%2Fqq459322464.html
>>>>
>>>> thanks
>>>>
>>>> Tony
>>>>
>>>>
>>>> On Sat, Jun 8, 2013 at 11:08 PM, Nick Cox <[email protected]> wrote:
>>>>> OK.
>>>>>
>>>>> Looking at the file in a text editor shows that alternate lines are
>>>>> blank. I don't know which lines are data for you.
>>>>> Nick
>>>>> [email protected]
>>>>>
>>>>>
>>>>> On 8 June 2013 16:04, Li Chuntao (Tony) <[email protected]> wrote:
>>>>>> Yes, of course i understand my own code. Here i just want to display
>>>>>> the first two lines to show that there is a messay output and seeking
>>>>>> helps.
>>>>>>
>>>>>> Thank you, Nick, for your always kind help helpfulness
>>>>>>
>>>>>> Tony
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Jun 8, 2013 at 10:59 PM, Nick Cox <[email protected]> wrote:
>>>>>>> Your own code doesn't seem well matched to the input. In your first
>>>>>>> post you were looping over the lines of the file, reading them one by
>>>>>>> one and then processing them. You have abandoned that here. Do you
>>>>>>> understand what the original Mata code does?
>>>>>>> Nick
>>>>>>> [email protected]
>>>>>>>
>>>>>>>
>>>>>>> On 8 June 2013 15:51, Li Chuntao (Tony) <[email protected]> wrote:
>>>>>>>> Well, it still does not work, as can be seen from the output of the
>>>>>>>> following codes:
>>>>>>>>
>>>>>>>> I mean, the output from http://html2text.theinfo.org seems quite
>>>>>>>> clean, but it turns to a messay when i tried to read it into Stata,
>>>>>>>> weather by insheet using or by the Mata code followed.
>>>>>>>>
>>>>>>>> Do anyone have such an experience?
>>>>>>>>
>>>>>>>>
>>>>>>>> thanks
>>>>>>>>
>>>>>>>> Chuntao
>>>>>>>>
>>>>>>>>
>>>>>>>> copy "http://html2text.theinfo.org/?url=++http%3A%2F%2Fqq.ico.la%2Fqq459322464.html"
>>>>>>>> d:\temp.txt, replace
>>>>>>>> mata:
>>>>>>>> fh = fopen("d:\temp.txt", "r")
>>>>>>>> junk=fget(fh)
>>>>>>>> junk
>>>>>>>> junk=fget(fh)
>>>>>>>> junk
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jun 7, 2013 at 2:36 AM, Sergiy Radyakin <[email protected]> wrote:
>>>>>>>>> Chuntao,
>>>>>>>>>
>>>>>>>>> adding to Nick's comments, you don't have to parse HTML code yourself
>>>>>>>>> as this is a pretty standard task. For your purposes the following
>>>>>>>>> should yield a pretty clean file:
>>>>>>>>> http://html2text.theinfo.org/?url=http%3A%2F%2Fqq.ico.la%2Fqq459322466.html
>>>>>>>>>
>>>>>>>>> where you supply your URL as a parameter.
>>>>>>>>>
>>>>>>>>> Best, Sergiy Radyakin
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jun 6, 2013 at 12:58 PM, Nick Cox <[email protected]> wrote:
>>>>>>>>>> If a file contains junk in lines 1 to 31, don't skip lines 1 to 34!
>>>>>>>>>>
>>>>>>>>>> A more fundamental point is that this is HTML:
>>>>>>>>>>
>>>>>>>>>> 1. So, lines will necessarily include HTML markup code in many if not
>>>>>>>>>> all lines. You will need to strip those too, or interpret them.
>>>>>>>>>>
>>>>>>>>>> 2. Mark-up code won't necessarily be interpretable if you ignore previous lines.
>>>>>>>>>>
>>>>>>>>>> In this particular case, there are many references to yet other files,
>>>>>>>>>> perhaps not of concern to you.
>>>>>>>>>>
>>>>>>>>>> I can't read Chinese, so that is far as I go.
>>>>>>>>>>
>>>>>>>>>> Nick
>>>>>>>>>> [email protected]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 6 June 2013 14:36, Li Chuntao (Tony) <[email protected]> wrote:
>>>>>>>>>>> Dear Listers,
>>>>>>>>>>>
>>>>>>>>>>> I want to import the following HTML source files:
>>>>>>>>>>>
>>>>>>>>>>> http://qq.ico.la/qq459322466.html
>>>>>>>>>>>
>>>>>>>>>>> The source file contains some information in Chinese, which is
>>>>>>>>>>> located in line 32 to 73.
>>>>>>>>>>>
>>>>>>>>>>> i tried to import the information by using the following code:
>>>>>>>>>>>
>>>>>>>>>>> clear all
>>>>>>>>>>> set obs 500
>>>>>>>>>>> copy "http://qq.ico.la/qq459322466.html" d:\qq.txt, replace
>>>>>>>>>>>
>>>>>>>>>>> mata:
>>>>>>>>>>> fh = fopen("d:\qq.txt", "r")
>>>>>>>>>>> for(i=1; i<=34; i++) {
>>>>>>>>>>> junk=fget(fh)
>>>>>>>>>>> }
>>>>>>>>>>> for(i=; i<=20; i++) {
>>>>>>>>>>> junk=fget(fh)
>>>>>>>>>>> junk
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> end
>>>>>>>>>>>
>>>>>>>>>>> but the result data in memory is only a messy.
>>>>>>>>>>>
>>>>>>>>>>> Similar code has been used for other webpage, thanks to Prof. Kit
>>>>>>>>>>> Baum, as can be seen following:
>>>>>>>>>>>
>>>>>>>>>>> clear all
>>>>>>>>>>> set obs 500
>>>>>>>>>>> local stkcd="000002"
>>>>>>>>>>> gen str20 date="2012.12.31"
>>>>>>>>>>> copy "http://stockdata.stock.hexun.com/2008/lr.aspx?stockid=`stkcd'&accountdate=2012.12.31"
>>>>>>>>>>> d:\date.txt, replace
>>>>>>>>>>> mata:
>>>>>>>>>>> fh = fopen("d:\date.txt", "r")
>>>>>>>>>>> for(i=1; i<=444; i++) {
>>>>>>>>>>> junk=fget(fh)
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> Can someone familiar with Chinese encoding give me some hits?
>>>>>>>>>>>
>>>>>>>>>>> Best
>>>>>>>>>>>
>>>>>>>>>>> Chuntao
>>>>>>>>>>> *
>>>>>>>>>>> * For searches and help try:
>>>>>>>>>>> * http://www.stata.com/help.cgi?search
>>>>>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>>>>>>> *
>>>>>>>>>> * For searches and help try:
>>>>>>>>>> * http://www.stata.com/help.cgi?search
>>>>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>>>>>> *
>>>>>>>>> * For searches and help try:
>>>>>>>>> * http://www.stata.com/help.cgi?search
>>>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>>>>> *
>>>>>>>> * For searches and help try:
>>>>>>>> * http://www.stata.com/help.cgi?search
>>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>>>> *
>>>>>>> * For searches and help try:
>>>>>>> * http://www.stata.com/help.cgi?search
>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>>> *
>>>>>> * For searches and help try:
>>>>>> * http://www.stata.com/help.cgi?search
>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>> *
>>>>> * For searches and help try:
>>>>> * http://www.stata.com/help.cgi?search
>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>> *
>>>> * For searches and help try:
>>>> * http://www.stata.com/help.cgi?search
>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>> * http://www.ats.ucla.edu/stat/stata/
>>> *
>>> * For searches and help try:
>>> * http://www.stata.com/help.cgi?search
>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>> * http://www.ats.ucla.edu/stat/stata/
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/