[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Insheeting Japanese

From	"Dan Weitzenfeld" <[email protected]>
To	[email protected]
Subject	Re: st: Insheeting Japanese
Date	Tue, 23 Sep 2008 12:01:04 -0700

Thanks all, VERY helpful.
I am going to take a crack at parsing it.  I'll post the results for posterity.

On Tue, Sep 23, 2008 at 11:55 AM, Sergiy Radyakin
<[email protected]> wrote:
> Right. Insheet is for ASCII (text) data only.
>
> With 2-byte codes, second byte can be a byte which has a special
> (control) meaning in ASCII (e.g. fields separator, or end of line) and
> this will confuse -insheet-.
>
> Joseph Coveney has had a presentation on using ODBC and datasets with
> unicode "Working with ODBC data sources in Stata--tips and techniques"
> if you have access to it, you may find some tips there. (I don't have
> it, but I'd love to see it myself). Here is a link to the abstract:
> http://ideas.repec.org/p/boc/asug04/10.html
>
> Perhaps you could just configure an ODBC datasource and read your data
> from there, rather then parsing UTF-16 yourself (which should not be
> very difficult anyways). Most importantly you need to know whether one
> character is always two bytes (according to
> http://en.wikipedia.org/wiki/UTF-16 it can be 4 as well). If it is
> always 2, determine which 2 stand for the field separator, read a
> line, skip until you meet the separator, start reading data and stop
> when you hit the next separator, decode/output the number, trash the
> rest of the line.
>
> Best regards,
>    Sergiy Radyakin
>
> On Tue, Sep 23, 2008 at 2:35 PM, Austin Nichols <[email protected]> wrote:
>> Dan Weitzenfeld :
>> Stata's -file- command can deal with this file; see -help file- for
>> examples of writing a loop to process a file.  But converting in
>> another program, then using -infile- or -insheet-, is likely easier.
>> The optimal approach depends on how often you will face this situation
>> again in future...
>>
>> On Tue, Sep 23, 2008 at 2:28 PM, Steven Samuels
>> <[email protected]> wrote:
>>> Dan, I don't know if Stata can read unicode.  The -help- for -insheet-
>>> states it is for ASCII text.  One possibility; use a text editor to add
>>> double quotes (") at the beginning and end of lines and on either side of
>>> the commas. This may read everything as character.  Then convert the convert
>>> back to real only the variable you want.
>>>
>>> -Steve
>>>
>>> On Sep 23, 2008, at 2:19 PM, Dan Weitzenfeld wrote:
>>>
>>>> I've been informed that the files are written in unicode, utf-16.  Can
>>>> Stata read this?
>>>>
>>>> On Tue, Sep 23, 2008 at 11:08 AM, Dan Weitzenfeld
>>>> <[email protected]> wrote:
>>>>>
>>>>> Thanks Sergiy, I did not know about that command.  Below is a line
>>>>> from my hexdump:
>>>>>
>>>>>            130 | 304b ff1f 002c 0031 002c 0032 000d 000a |
>>>>> 0K...,.1.,.2....
>>>>>
>>>>> I also noticed this when I ran with option Analyze:
>>>>>
>>>>>  Line-end characters
>>>>>   \r\n         (Windows)             0
>>>>>   \r by itself (Mac)                  5
>>>>>   \n by itself (Unix)                 5
>>>>>
>>>>> which looks suspicious to me.   I'll talk to the tech guys who made this
>>>>> file.
>>>>> Thanks again Sergiy.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Sep 23, 2008 at 10:51 AM, Sergiy Radyakin
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> Dear Dan,
>>>>>>
>>>>>> how data "looks like" depends on, which software "looks" at it. From
>>>>>> what I see in your message, there is double-byte encoding of letters
>>>>>> which may cause a problem.
>>>>>>
>>>>>> I suggest you first "look" at your data byte-by-byte, to find a
>>>>>> pattern you need, then filter your data based on that pattern.
>>>>>> Use
>>>>>>  -hexdump- filename
>>>>>> to see how your data is structured. Check that you are using correct
>>>>>> separator "comma" and not "tab", that "comma" in your file is indeed a
>>>>>> standard ASCII "comma" and not some weird two-bytes comma, that a
>>>>>> "comma" byte (44) is not used for encoding other characters, etc.
>>>>>>
>>>>>> Perhaps you could post a portion of output from hexdump here if this
>>>>>> does not contradict any rules of the list.
>>>>>>
>>>>>> Regards, Sergiy Radyakin
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 23, 2008 at 1:09 PM, Dan Weitzenfeld
>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>> Hi All,
>>>>>>> Quick but strange question.  I'm trying to insheet a comma-delimited
>>>>>>> file with Japanese in it.  For example, the first line looks like:
>>>>>>>
>>>>>>> あなたはこのＣＭが好きですか？,0,とても好き
>>>>>>>
>>>>>>> The only information I need is the second variable, the 0, which will
>>>>>>> always be numeric.
>>>>>>>
>>>>>>> However, when I insheet the file, I get nonsense:
>>>>>>>
>>>>>>> þÿ0B0j0_0o0S0nÿ#ÿ-0LY}0M0g0Y0Kÿ                 0h0f0‚Y}0M
>>>>>>>
>>>>>>> which would be okay, except that the second variable always comes in as
>>>>>>> blank.
>>>>>>>
>>>>>>> Does anyone know of a solution for this?
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>> Dan
>>>>>>>
>>>>>>> *
>>>>>>> *   For searches and help try:
>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>> *   http://www.stata.com/support/statalist/faq
>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>>
>>>>>>
>>>>>> *
>>>>>> *   For searches and help try:
>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>> *   http://www.stata.com/support/statalist/faq
>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>
>>>>>
>>>>
>>>> *
>>>> *   For searches and help try:
>>>> *   http://www.stata.com/help.cgi?search
>>>> *   http://www.stata.com/support/statalist/faq
>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/statalist/faq
>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: Insheeting Japanese
  - From: "Dan Weitzenfeld" <[email protected]>
- Re: st: Insheeting Japanese
  - From: "Sergiy Radyakin" <[email protected]>
- Re: st: Insheeting Japanese
  - From: "Dan Weitzenfeld" <[email protected]>
- Re: st: Insheeting Japanese
  - From: "Dan Weitzenfeld" <[email protected]>
- Re: st: Insheeting Japanese
  - From: Steven Samuels <[email protected]>
- Re: st: Insheeting Japanese
  - From: "Austin Nichols" <[email protected]>
- Re: st: Insheeting Japanese
  - From: "Sergiy Radyakin" <[email protected]>

Prev by Date: Re: st: Apple Script to Comment Lines in Text Wranger/BBEdit
Next by Date: Re: st: Apple Script to Comment Lines in Text Wranger/BBEdit
Previous by thread: Re: st: Insheeting Japanese
Next by thread: Re: st: Insheeting Japanese
Index(es):
- Date
- Thread