[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: insheet delimiter problem

From	"Nick Cox" <[email protected]>
To	<[email protected]>
Subject	RE: st: insheet delimiter problem
Date	Mon, 10 Nov 2008 12:59:33 -0000

Utilities like sed are a good idea; as Neil says, they have been ported
to Windows too (GNU project as well as the sources he cites). 

But check out -filefilter- in Stata. 

[D]     filefilter  . . . . .  Convert ASCII text or binary patterns in
a file
        (help filefilter)

FAQ     . . . . . . . . . . . . . . . . . . . . Malformed end-of-line
sequence
        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J.
Hassell
        12/03   Why do I get rows of missing data when I use infile?
                http://www.stata.com/support/faqs/data/miss_data.html

SJ-8-2  pr0039  . Stata tip 60: Fast and easy changes to files with
filefilter
        . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  A. R.
Riley
        Q2/08   SJ 8(2):290--292                                 (no
commands)
        tip on how to make changes to a file using the
        filefilter command

I would pre-process the file so that double quotes were edited to
something else. The character @ is often a good candidate. 

You can check with -hexdump- which characters are used in the file. 

The FAQ and Stata Tip give detailed examples. 

Nick 
[email protected] 

Neil Shephard

Ada Ma wrote:
> Thanks for the reply.  Here is an example I have created which is
> close to what happened.  The data should look like this:
>
> epikey	hrg	        code1	code2	code3
> 1	        A0123	D100  	V123	        K166
> 2	        A0125	D200	        "	        G122
> 3	        B0101       D300    	"	        C333
> 4	        B0122	D400	        E002	        V777
>
> It is pipe delimited so in the text file it looks like this:
>
> epikey|hrg|code1|code2|code3
> 1|A0123|D100|V123|K166
> 2|A0125|D200|"|G122
> 3|B0101|D300|"|C333
> 4|B0122|D400|E002|V777
>
> When I specified the command as you stated above, i.e. specifying the
> delim("|") option, Stata reads in this:
>
> epikey	hrg	        code1	code2
code3
> 1	        A0123	D100  	V123
K166
> 2	        A0125	D200	        |G1223|B0101|D300|
C333
> 4	        B0122	D400	        E002
V777
>
> So everything between the double quotes are treated as one string.  Is
> there any way to get around this without editing the txt file?
>
>   
Hmm, that is problematic, and not quite what I'd expect, but I can see
clearly why its happening.  Stata sees the first double quote and
assumes that it is encapsulating a string variable, and reads until it
sees the next (closing) string variable, treating any pipes ("|") as
part of the string.

I'm not sure how to work around this in Stata I'm afraid.  You may gain
some mileage writing a custom dictionary and using -infile-.

Personally I would make a system call to the common *NIX-like command
'sed' to search and replace any instances of double-quotes.  This has
the advantage of being automated as the system call can be placed in
your do-file (as opposed to manually opening the file in your text
editor and doing the search and replace).  At the same time it has the
disadvantage of not being handled internally in Stata, making it
somewhat less platform neutral (would probably work fine on Linux and
Macs, but you'd have to have some trickery to call sed under a Cygwin
installation under Windows, I've done it in the past, but can't quote
remember the finer details).  There may be a similar command  (or indeed
native version of sed) under M$-windows Command Prompt, but I'm not
aware of it.

Another option would be to ask the people who sent you the data to
choose an alternative character/symbol/number for missing data (quite
why they chose double-quotes in the first place is a mystery only they
can answer as it has the potential mess things up, as you've found ,by
virtue of being the character used to encapsulate strings by many
databases and software).

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: insheet delimiter problem
  - From: "Ada Ma" <[email protected]>

References:
- st: insheet delimiter problem
  - From: "Ada Ma" <[email protected]>
- Re: st: insheet delimiter problem
  - From: Neil Shephard <[email protected]>
- Re: st: insheet delimiter problem
  - From: "Ada Ma" <[email protected]>
- Re: st: insheet delimiter problem
  - From: Neil Shephard <[email protected]>

Prev by Date: Re: st: insheet delimiter problem
Next by Date: Re: st: insheet delimiter problem
Previous by thread: Re: st: insheet delimiter problem
Next by thread: Re: st: insheet delimiter problem
Index(es):
- Date
- Thread