Utilities like sed are a good idea; as Neil says, they have been ported
to Windows too (GNU project as well as the sources he cites).
But check out -filefilter- in Stata.
[D] filefilter . . . . . Convert ASCII text or binary patterns in
a file
(help filefilter)
FAQ . . . . . . . . . . . . . . . . . . . . Malformed end-of-line
sequence
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J.
Hassell
12/03 Why do I get rows of missing data when I use infile?
http://www.stata.com/support/faqs/data/miss_data.html
SJ-8-2 pr0039 . Stata tip 60: Fast and easy changes to files with
filefilter
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. R.
Riley
Q2/08 SJ 8(2):290--292 (no
commands)
tip on how to make changes to a file using the
filefilter command
I would pre-process the file so that double quotes were edited to
something else. The character @ is often a good candidate.
You can check with -hexdump- which characters are used in the file.
The FAQ and Stata Tip give detailed examples.
Nick
[email protected]
Neil Shephard
Ada Ma wrote:
> Thanks for the reply. Here is an example I have created which is
> close to what happened. The data should look like this:
>
> epikey hrg code1 code2 code3
> 1 A0123 D100 V123 K166
> 2 A0125 D200 " G122
> 3 B0101 D300 " C333
> 4 B0122 D400 E002 V777
>
> It is pipe delimited so in the text file it looks like this:
>
> epikey|hrg|code1|code2|code3
> 1|A0123|D100|V123|K166
> 2|A0125|D200|"|G122
> 3|B0101|D300|"|C333
> 4|B0122|D400|E002|V777
>
> When I specified the command as you stated above, i.e. specifying the
> delim("|") option, Stata reads in this:
>
> epikey hrg code1 code2
code3
> 1 A0123 D100 V123
K166
> 2 A0125 D200 |G1223|B0101|D300|
C333
> 4 B0122 D400 E002
V777
>
> So everything between the double quotes are treated as one string. Is
> there any way to get around this without editing the txt file?
>
>
Hmm, that is problematic, and not quite what I'd expect, but I can see
clearly why its happening. Stata sees the first double quote and
assumes that it is encapsulating a string variable, and reads until it
sees the next (closing) string variable, treating any pipes ("|") as
part of the string.
I'm not sure how to work around this in Stata I'm afraid. You may gain
some mileage writing a custom dictionary and using -infile-.
Personally I would make a system call to the common *NIX-like command
'sed' to search and replace any instances of double-quotes. This has
the advantage of being automated as the system call can be placed in
your do-file (as opposed to manually opening the file in your text
editor and doing the search and replace). At the same time it has the
disadvantage of not being handled internally in Stata, making it
somewhat less platform neutral (would probably work fine on Linux and
Macs, but you'd have to have some trickery to call sed under a Cygwin
installation under Windows, I've done it in the past, but can't quote
remember the finer details). There may be a similar command (or indeed
native version of sed) under M$-windows Command Prompt, but I'm not
aware of it.
Another option would be to ask the people who sent you the data to
choose an alternative character/symbol/number for missing data (quite
why they chose double-quotes in the first place is a mystery only they
can answer as it has the potential mess things up, as you've found ,by
virtue of being the character used to encapsulate strings by many
databases and software).
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/