Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <njcoxstata@gmail.com> |
To | "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |
Subject | Re: st: reading HTML source in Chinese but get a messy code |
Date | Sat, 8 Jun 2013 16:08:42 +0100 |
OK. Looking at the file in a text editor shows that alternate lines are blank. I don't know which lines are data for you. Nick njcoxstata@gmail.com On 8 June 2013 16:04, Li Chuntao (Tony) <leechtcn@gmail.com> wrote: > Yes, of course i understand my own code. Here i just want to display > the first two lines to show that there is a messay output and seeking > helps. > > Thank you, Nick, for your always kind help helpfulness > > Tony > > > > On Sat, Jun 8, 2013 at 10:59 PM, Nick Cox <njcoxstata@gmail.com> wrote: >> Your own code doesn't seem well matched to the input. In your first >> post you were looping over the lines of the file, reading them one by >> one and then processing them. You have abandoned that here. Do you >> understand what the original Mata code does? >> Nick >> njcoxstata@gmail.com >> >> >> On 8 June 2013 15:51, Li Chuntao (Tony) <leechtcn@gmail.com> wrote: >>> Well, it still does not work, as can be seen from the output of the >>> following codes: >>> >>> I mean, the output from http://html2text.theinfo.org seems quite >>> clean, but it turns to a messay when i tried to read it into Stata, >>> weather by insheet using or by the Mata code followed. >>> >>> Do anyone have such an experience? >>> >>> >>> thanks >>> >>> Chuntao >>> >>> >>> copy "http://html2text.theinfo.org/?url=++http%3A%2F%2Fqq.ico.la%2Fqq459322464.html"; >>> d:\temp.txt, replace >>> mata: >>> fh = fopen("d:\temp.txt", "r") >>> junk=fget(fh) >>> junk >>> junk=fget(fh) >>> junk >>> >>> } >>> >>> >>> >>> On Fri, Jun 7, 2013 at 2:36 AM, Sergiy Radyakin <serjradyakin@gmail.com> wrote: >>>> Chuntao, >>>> >>>> adding to Nick's comments, you don't have to parse HTML code yourself >>>> as this is a pretty standard task. For your purposes the following >>>> should yield a pretty clean file: >>>> http://html2text.theinfo.org/?url=http%3A%2F%2Fqq.ico.la%2Fqq459322466.html >>>> >>>> where you supply your URL as a parameter. >>>> >>>> Best, Sergiy Radyakin >>>> >>>> >>>> On Thu, Jun 6, 2013 at 12:58 PM, Nick Cox <njcoxstata@gmail.com> wrote: >>>>> If a file contains junk in lines 1 to 31, don't skip lines 1 to 34! >>>>> >>>>> A more fundamental point is that this is HTML: >>>>> >>>>> 1. So, lines will necessarily include HTML markup code in many if not >>>>> all lines. You will need to strip those too, or interpret them. >>>>> >>>>> 2. Mark-up code won't necessarily be interpretable if you ignore previous lines. >>>>> >>>>> In this particular case, there are many references to yet other files, >>>>> perhaps not of concern to you. >>>>> >>>>> I can't read Chinese, so that is far as I go. >>>>> >>>>> Nick >>>>> njcoxstata@gmail.com >>>>> >>>>> >>>>> On 6 June 2013 14:36, Li Chuntao (Tony) <leechtcn@gmail.com> wrote: >>>>>> Dear Listers, >>>>>> >>>>>> I want to import the following HTML source files: >>>>>> >>>>>> http://qq.ico.la/qq459322466.html >>>>>> >>>>>> The source file contains some information in Chinese, which is >>>>>> located in line 32 to 73. >>>>>> >>>>>> i tried to import the information by using the following code: >>>>>> >>>>>> clear all >>>>>> set obs 500 >>>>>> copy "http://qq.ico.la/qq459322466.html"; d:\qq.txt, replace >>>>>> >>>>>> mata: >>>>>> fh = fopen("d:\qq.txt", "r") >>>>>> for(i=1; i<=34; i++) { >>>>>> junk=fget(fh) >>>>>> } >>>>>> for(i=; i<=20; i++) { >>>>>> junk=fget(fh) >>>>>> junk >>>>>> } >>>>>> >>>>>> end >>>>>> >>>>>> but the result data in memory is only a messy. >>>>>> >>>>>> Similar code has been used for other webpage, thanks to Prof. Kit >>>>>> Baum, as can be seen following: >>>>>> >>>>>> clear all >>>>>> set obs 500 >>>>>> local stkcd="000002" >>>>>> gen str20 date="2012.12.31" >>>>>> copy "http://stockdata.stock.hexun.com/2008/lr.aspx?stockid=`stkcd'&accountdate=2012.12.31" >>>>>> d:\date.txt, replace >>>>>> mata: >>>>>> fh = fopen("d:\date.txt", "r") >>>>>> for(i=1; i<=444; i++) { >>>>>> junk=fget(fh) >>>>>> } >>>>>> >>>>>> Can someone familiar with Chinese encoding give me some hits? >>>>>> >>>>>> Best >>>>>> >>>>>> Chuntao >>>>>> * >>>>>> * For searches and help try: >>>>>> * http://www.stata.com/help.cgi?search >>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>> * >>>>> * For searches and help try: >>>>> * http://www.stata.com/help.cgi?search >>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>> * http://www.ats.ucla.edu/stat/stata/ >>>> * >>>> * For searches and help try: >>>> * http://www.stata.com/help.cgi?search >>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>> * http://www.ats.ucla.edu/stat/stata/ >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> * http://www.ats.ucla.edu/stat/stata/ >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/