Friedrich Huebler wrote:
I work with tab-delimited text files with string variables that
sometimes contain quote marks. If the quotes appear in pairs, the data
is imported but the quotes are stripped from the data. When a string
contains a single quote mark (i.e., a quote mark not followed by a
second quote mark), Stata fills that particular variable up to the
maximum string length of 244 characters and then stops the import so
that all remaining data from the original file is ignored. The problem
can be reproduced with these three test files:
test1.txt:
row11 row12 row13
row21 row22 row23
row31 row32 row33
test2.txt:
row11 row12 row13
row21 "row"22 row23
row31 row32 row33
test3.txt:
row11 row12 row13
row21 row"22 row23
row31 row32 row33
Each file has three lines of text, and each line has three strings
that are separated by tabs. test1.txt is a tab-delimited text files
without quotes; this file can be imported without problems. test2.txt
is a tab-delimited text files with a pair of quotes; the file is
imported but the quotes are removed. test3.txt has a single quote mark
and -insheet- fails. One of my text files has 95,000 lines and only
the first 918 lines are imported because of a single quote mark in
line 918.
[snip]
How can the data be imported into Stata with all observations and
preferably also with quotes, either single or in pairs? I can open the
files in a text editor and look for quotes that do not appear in pairs
to remove them manually, but this is inefficient and changes the
original data.
--------------------------------------------------------------------------------
Steven Dubnoff already responded, and using Stat/Transfer is a good way to
go. I have Stat/Transfer and it would be my first choice in your situation.
But if you don't happen to have Stat/Transfer, below are two additional,
Stata-only, ways. I'm using your test files (the latter two, which
illustrate your problem).
Joseph Coveney
P.S. I highly recommend getting Stat/Transfer if you don't already have it.
If you're dealing with external datasets delivered to you and over which you
don't have control, as seems to be the case in your situation, then you'll
eventually need it, whether it be a SAS or SPSS dataset, or someone's poorly
laid out Excel spreadsheet.
clear *
set more off
*
* Second way
*
tempname file_handle
forvalues file = 2/3 {
drop _all
file open `file_handle' using test`file'.txt, read text
file read `file_handle' line
quietly set obs 1
generate str a = `"`macval(line)'"'
quietly while r(eof) == 0 {
file read `file_handle' line
set obs `=_N + 1'
replace a = `"`macval(line)'"' in l
}
file close `file_handle'
// list, noobs
quietly split a, generate(column) parse(`=char(9)')
drop a
list, noobs
}
*
* Third way
*
tempfile tmpfil0
forvalues file = 2/3 {
quietly {
filefilter test`file'.txt `tmpfil0', ///
from(\Q) to(\147d) replace
insheet using `tmpfil0', nonames tab clear
foreach var of varlist v* {
replace `var' = subinstr(`var', char(147), ///
char(34), .)
}
}
list, noobs
}
exit
Output (redacted):
[do-file recitation snipped]
+-----------------------------+
| column1 column2 column3 |
|-----------------------------|
| row11 row12 row13 |
| row21 "row"22 row23 |
| row31 row32 row33 |
| |
+-----------------------------+
+-----------------------------+
| column1 column2 column3 |
|-----------------------------|
| row11 row12 row13 |
| row21 row"22 row23 |
| row31 row32 row33 |
| |
+-----------------------------+
[ditto]
+-------------------------+
| v1 v2 v3 |
|-------------------------|
| row11 row12 row13 |
| row21 "row"22 row23 |
| row31 row32 row33 |
+-------------------------+
+------------------------+
| v1 v2 v3 |
|------------------------|
| row11 row12 row13 |
| row21 row"22 row23 |
| row31 row32 row33 |
+------------------------+
. exit
end of do-file
.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/