Kevin Crow of Statacorp was kind enough to help me sort through this
issue. The problem was that there were binary characters hidden in the
header of the file, which caused both -insheet- and -infile- to have
problems.
The solution ended up using -filefilter- to convert the binary
characters (hex #14 in my case) to null. I verified the conversion
using -hexdump, analyze-. Hexdump diagnosed the original problem, as
one of its return values, r(format) will be set to "BINARY" if any
binary characters are found in the source, and will be set to "ASCII" if
the file is truly that format.
Finally, an infile dictionary turned out to be the best way to ignore
the header and read in my data with the proper column registrations. My
do-file now contains the following code as part of the loop accumulating
source data files:
-----------begin excerpt from datagather.do-------------
quietly hexdump `filen', analyze
di r(format)
quietly filefilter `filen' "filtered.txt", from(\14h) to("") replace
quietly hexdump "`filtered'.txt", analyze
di r(format)
assert r(format)=="ASCII"
quietly {
infile using datadict.dct in 1/28900, using("`trimfilen'.txt")
clear
}
-----------end excerpt from datagather.do ---------
The dictionary contains:
-----------begin datadict.dct-------------
infile dictionary {
_firstline(25)
int v1
int v2
float v3
float v4
int v5
}
-----------end datadict.dct-------------
Some take-home lessons from this effort were:
* Don't assume that just because someone tells you a text file
is ASCII that its true! -Hexdump- can verify.
* If you use _firstline() in your dictionary to skip a header,
and you want to select a range in your -infile- statement, have the
range start with 1.
* If you're going to specify storage types in your dictionary,
make sure they are large enough for the data you're importing! I
initially was using byte, and then was scratching my head over v1 and v2
coming in with missing characters for values larger than 100.
I probably got lucky in that the only binary character in my files ended
up being \h14. I imagine that one could construct a loop going over the
-filefilter- command to increment across other ranges to eliminate a
wider variety of non-ascii characters from the file. Fortunately,
-filefilter- appears to be a very fast process.
Thanks again to Kevin and Statacorp for helping me solve the problem.
John Wallace
Research Associate
Affymetrix, Inc
[email protected]
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Wallace, John
Sent: Friday, October 26, 2007 2:27 PM
To: [email protected]
Subject: st: Infile errors
Hello statalisters
I have a class of text files that I would like to use as a data source.
I used -type- to verify that the data is ascii. The majority of the
file is 5 columns of tab-delimited, numerical text. A single row
immediately before the data has single-word, tab-delimited header
information for the 5 columns, but I'm happy ignoring them if necessary.
Preceeding the header line are 23 lines of text that I don't care about.
No tab characters occur in those 23 lines however.
-insheet- is confused by the 23 lines, and imports them as string
records for variable v1 - it creates string variables v2-v5 which has
the 2nd through 5th columns of data, beginning at line 25.
-infile- would seem to be the appropriate command:
.. infile a b c d e in 25/1325 using source.txt
but this doesn't work exactly right. It places the data from line 27
column 3 in a[1], which causes a registration shift in the dataset (and
the omission of the first 3 lines of data altogether). Explicitly:
---begin source.txt line 25---
0<T> 0<T>379.0<T>76.8<T> 64
1<T> 0<T>54.0<T>14.9<T> 64
2<T> 0<T>28.0<T>5.1<T> 64
3<T> 0<T>24.0<T>2.9<T> 64
4<T> 0<T>22.0<T>2.7<T> 64
5<T> 0<T>24.0<T>2.9<T> 64
6<T> 0<T>27.0<T>2.8<T> 64
7<T> 0<T>25.0<T>2.7<T> 64
---end source.txt line 32---
....is imported as
| a b c d e |
|--------------------------|
1. | 28 5.1 64 3 0 |
2. | 24 2.9 64 4 0 |
3. | 22 2.7 64 5 0 |
4. | 24 2.9 64 6 0 |
5. | 27 2.8 64 7 0 |
Is there another way? Infile with a dictionary seems like the next
step(?), but I'd like to know why specifying the record range with "in
25/1325" didn't work.
Thanks for any help
John Wallace
------------------------------------------------------------
This transmission is intended for the sole use of the individual
and entity to whom it is addressed, and may contain information
that is privileged, confidential and exempt from disclosure under
applicable law. You are hereby notified that any use,
dissemination, distribution or duplication of this transmission by
someone other than the intended addressee or its designated agent
is strictly prohibited. If you have received this transmission in
error, please notify the sender immediately by reply to this
transmission and delete it from your computer. Thank You.
Affymetrix, Inc.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/