Right. Insheet is for ASCII (text) data only.
With 2-byte codes, second byte can be a byte which has a special
(control) meaning in ASCII (e.g. fields separator, or end of line) and
this will confuse -insheet-.
Joseph Coveney has had a presentation on using ODBC and datasets with
unicode "Working with ODBC data sources in Stata--tips and techniques"
if you have access to it, you may find some tips there. (I don't have
it, but I'd love to see it myself). Here is a link to the abstract:
http://ideas.repec.org/p/boc/asug04/10.html
Perhaps you could just configure an ODBC datasource and read your data
from there, rather then parsing UTF-16 yourself (which should not be
very difficult anyways). Most importantly you need to know whether one
character is always two bytes (according to
http://en.wikipedia.org/wiki/UTF-16 it can be 4 as well). If it is
always 2, determine which 2 stand for the field separator, read a
line, skip until you meet the separator, start reading data and stop
when you hit the next separator, decode/output the number, trash the
rest of the line.
Best regards,
Sergiy Radyakin
On Tue, Sep 23, 2008 at 2:35 PM, Austin Nichols <[email protected]> wrote:
> Dan Weitzenfeld :
> Stata's -file- command can deal with this file; see -help file- for
> examples of writing a loop to process a file. But converting in
> another program, then using -infile- or -insheet-, is likely easier.
> The optimal approach depends on how often you will face this situation
> again in future...
>
> On Tue, Sep 23, 2008 at 2:28 PM, Steven Samuels
> <[email protected]> wrote:
>> Dan, I don't know if Stata can read unicode. The -help- for -insheet-
>> states it is for ASCII text. One possibility; use a text editor to add
>> double quotes (") at the beginning and end of lines and on either side of
>> the commas. This may read everything as character. Then convert the convert
>> back to real only the variable you want.
>>
>> -Steve
>>
>> On Sep 23, 2008, at 2:19 PM, Dan Weitzenfeld wrote:
>>
>>> I've been informed that the files are written in unicode, utf-16. Can
>>> Stata read this?
>>>
>>> On Tue, Sep 23, 2008 at 11:08 AM, Dan Weitzenfeld
>>> <[email protected]> wrote:
>>>>
>>>> Thanks Sergiy, I did not know about that command. Below is a line
>>>> from my hexdump:
>>>>
>>>> 130 | 304b ff1f 002c 0031 002c 0032 000d 000a |
>>>> 0K...,.1.,.2....
>>>>
>>>> I also noticed this when I ran with option Analyze:
>>>>
>>>> Line-end characters
>>>> \r\n (Windows) 0
>>>> \r by itself (Mac) 5
>>>> \n by itself (Unix) 5
>>>>
>>>> which looks suspicious to me. I'll talk to the tech guys who made this
>>>> file.
>>>> Thanks again Sergiy.
>>>>
>>>>
>>>>
>>>> On Tue, Sep 23, 2008 at 10:51 AM, Sergiy Radyakin
>>>> <[email protected]> wrote:
>>>>>
>>>>> Dear Dan,
>>>>>
>>>>> how data "looks like" depends on, which software "looks" at it. From
>>>>> what I see in your message, there is double-byte encoding of letters
>>>>> which may cause a problem.
>>>>>
>>>>> I suggest you first "look" at your data byte-by-byte, to find a
>>>>> pattern you need, then filter your data based on that pattern.
>>>>> Use
>>>>> -hexdump- filename
>>>>> to see how your data is structured. Check that you are using correct
>>>>> separator "comma" and not "tab", that "comma" in your file is indeed a
>>>>> standard ASCII "comma" and not some weird two-bytes comma, that a
>>>>> "comma" byte (44) is not used for encoding other characters, etc.
>>>>>
>>>>> Perhaps you could post a portion of output from hexdump here if this
>>>>> does not contradict any rules of the list.
>>>>>
>>>>> Regards, Sergiy Radyakin
>>>>>
>>>>>
>>>>> On Tue, Sep 23, 2008 at 1:09 PM, Dan Weitzenfeld
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> Hi All,
>>>>>> Quick but strange question. I'm trying to insheet a comma-delimited
>>>>>> file with Japanese in it. For example, the first line looks like:
>>>>>>
>>>>>> あなたはこのCMが好きですか?,0,とても好き
>>>>>>
>>>>>> The only information I need is the second variable, the 0, which will
>>>>>> always be numeric.
>>>>>>
>>>>>> However, when I insheet the file, I get nonsense:
>>>>>>
>>>>>> þÿ0B0j0_0o0S0nÿ#ÿ-0LY}0M0g0Y0Kÿ 0h0f0‚Y}0M
>>>>>>
>>>>>> which would be okay, except that the second variable always comes in as
>>>>>> blank.
>>>>>>
>>>>>> Does anyone know of a solution for this?
>>>>>>
>>>>>> Thanks in advance,
>>>>>> Dan
>>>>>>
>>>>>> *
>>>>>> * For searches and help try:
>>>>>> * http://www.stata.com/help.cgi?search
>>>>>> * http://www.stata.com/support/statalist/faq
>>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>>>
>>>>>
>>>>> *
>>>>> * For searches and help try:
>>>>> * http://www.stata.com/help.cgi?search
>>>>> * http://www.stata.com/support/statalist/faq
>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>>
>>>>
>>>
>>> *
>>> * For searches and help try:
>>> * http://www.stata.com/help.cgi?search
>>> * http://www.stata.com/support/statalist/faq
>>> * http://www.ats.ucla.edu/stat/stata/
>>
>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/statalist/faq
>> * http://www.ats.ucla.edu/stat/stata/
>>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/