[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: insheet fails with quotes in data

From	"Joseph Coveney" <[email protected]>
To	"Statalist" <[email protected]>
Subject	Re: st: insheet fails with quotes in data
Date	Sat, 19 Apr 2008 18:14:41 +0900

Friedrich Huebler wrote:

I work with tab-delimited text files with string variables that
sometimes contain quote marks. If the quotes appear in pairs, the data
is imported but the quotes are stripped from the data. When a string
contains a single quote mark (i.e., a quote mark not followed by a
second quote mark), Stata fills that particular variable up to the
maximum string length of 244 characters and then stops the import so
that all remaining data from the original file is ignored. The problem
can be reproduced with these three test files:

test1.txt:
row11 row12 row13
row21 row22 row23
row31 row32 row33

test2.txt:
row11 row12 row13
row21 "row"22 row23
row31 row32 row33

test3.txt:
row11 row12 row13
row21 row"22 row23
row31 row32 row33

Each file has three lines of text, and each line has three strings
that are separated by tabs. test1.txt is a tab-delimited text files
without quotes; this file can be imported without problems. test2.txt
is a tab-delimited text files with a pair of quotes; the file is
imported but the quotes are removed. test3.txt has a single quote mark
and -insheet- fails. One of my text files has 95,000 lines and only
the first 918 lines are imported because of a single quote mark in
line 918.

[snip]

How can the data be imported into Stata with all observations and
preferably also with quotes, either single or in pairs? I can open the
files in a text editor and look for quotes that do not appear in pairs
to remove them manually, but this is inefficient and changes the
original data.

--------------------------------------------------------------------------------

Steven Dubnoff already responded, and using Stat/Transfer is a good way to
go.  I have Stat/Transfer and it would be my first choice in your situation.

But if you don't happen to have Stat/Transfer, below are two additional,
Stata-only, ways.  I'm using your test files (the latter two, which
illustrate your problem).

Joseph Coveney

P.S. I highly recommend getting Stat/Transfer if you don't already have it.
If you're dealing with external datasets delivered to you and over which you
don't have control, as seems to be the case in your situation, then you'll
eventually need it, whether it be a SAS or SPSS dataset, or someone's poorly
laid out Excel spreadsheet.

clear *
set more off
*
* Second way
*
tempname file_handle
forvalues file = 2/3 {
   drop _all
   file open `file_handle' using test`file'.txt, read text
   file read `file_handle' line
   quietly set obs 1
   generate str a = `"`macval(line)'"'
   quietly while r(eof) == 0 {
       file read `file_handle' line
       set obs `=_N + 1'
       replace a = `"`macval(line)'"' in l
   }
   file close `file_handle'
   // list, noobs
   quietly split a, generate(column) parse(`=char(9)')
   drop a
   list, noobs
}
*
* Third way
*
tempfile tmpfil0
forvalues file = 2/3 {
   quietly {
       filefilter test`file'.txt `tmpfil0', ///
         from(\Q) to(\147d) replace
       insheet using `tmpfil0', nonames tab clear
       foreach var of varlist v* {
           replace `var' = subinstr(`var', char(147), ///
             char(34), .)
       }
   }
   list, noobs
}
exit

Output (redacted):

[do-file recitation snipped]

 +-----------------------------+
 | column1   column2   column3 |
 |-----------------------------|
 |   row11     row12     row13 |
 |   row21   "row"22     row23 |
 |   row31     row32     row33 |
 |                             |
 +-----------------------------+

 +-----------------------------+
 | column1   column2   column3 |
 |-----------------------------|
 |   row11     row12     row13 |
 |   row21    row"22     row23 |
 |   row31     row32     row33 |
 |                             |
 +-----------------------------+

[ditto]

 +-------------------------+
 |    v1        v2      v3 |
 |-------------------------|
 | row11     row12   row13 |
 | row21   "row"22   row23 |
 | row31     row32   row33 |
 +-------------------------+

 +------------------------+
 |    v1       v2      v3 |
 |------------------------|
 | row11    row12   row13 |
 | row21   row"22   row23 |
 | row31    row32   row33 |
 +------------------------+

. exit

end of do-file


.


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Prev by Date: Re: st: trend in ORs across ordered levels of a 3rd variable
Next by Date: Re: st: insheet fails with quotes in data
Previous by thread: Re: st: insheet fails with quotes in data
Next by thread: st: losing observations while correcting for selection bias using selmlog
Index(es):
- Date
- Thread