Title | Malformed end-of-line sequences | |
Author | James Hassell, StataCorp |
Sometimes when you use Stata’s infile command, Stata reads data that yield alternating missing rows. For example, consider the following sample dataset, which includes its own dictionary:
dictionary { str18 make `"Make and Model"' int price `"Price"' int mpg `"Mileage (mpg)"' int rep78 `"Repair Record 1978"' float headroom `"Headroom (in.)"' int trunk `"Trunk space (cu. ft.)"' int weight `"Weight (lbs.)"' } "AMC Concord" 4099 22 3 2.5 11 2930 "AMC Pacer" 4749 17 3 3.0 11 3350 "AMC Spirit" 3799 22 . 3.0 12 2640 "Buick Century" 4816 20 3 4.5 16 3250 "Buick Electra" 7827 15 4 4.0 20 4080 "Buick LeSabre" 5788 18 3 4.0 21 3670
You would probably expect that the data in the file could be read with Stata’s infile command and that, subsequently, the data in memory would contain 6 rows. However, after Stata reads the data, you might be surprised to find the following output:
. infile using auto.raw, clear (output omitted) . list +-----------------------------------------------------------------+ | make price mpg rep78 headroom trunk weight | |-----------------------------------------------------------------| 1. | . . . . . . | 2. | AMC Concord 4099 22 3 2.5 11 2930 | 3. | . . . . . . | 4. | AMC Pacer 4749 17 3 3 11 3350 | 5. | . . . . . . | |-----------------------------------------------------------------| 6. | AMC Spirit 3799 22 . 3 12 2640 | 7. | . . . . . . | 8. | Buick Century 4816 20 3 4.5 16 3250 | 9. | . . . . . . | 10. | Buick Electra 7827 15 4 4 20 4080 | |-----------------------------------------------------------------| 11. | . . . . . . | 12. | Buick LeSabre 5788 18 3 4 21 3670 | 13. | . . . . . . | +-----------------------------------------------------------------+
At this point, you are probably wondering what has happened. The key to knowing what caused this behavior is to understand the end-of-line (EOL) characters on various platforms. Stata can safely and accurately read raw data that has valid Windows, Mac, or Unix EOL markers. The unexpected behavior encountered in the example above can be explained by the malformed EOL sequences contained in our test file (auto.raw). Valid EOL sequences from all three formats are listed in the table below:
Platform | Characters | ASCII Codes |
---|---|---|
Mac | \n | 10 |
Unix | \n | 10 |
Windows | \r\n | 13 10 |
As mentioned above, the file named auto.raw contained invalid EOL sequences. Here is the EOL sequence found in our test file: "\r\r\n". As you can see, the pattern does not match any of the three valid EOL sequences.
Stata has a command called hexdump, which can read and analyze raw binary data. Using hexdump with its analyze option displays some of the normally hidden attributes associated with a text file. For example,
. hexdump auto.raw, analyze Line-end characters Line length (tab=1) \r\n (Windows) 15 minimum 1 \r by itself (Old Mac) 15 maximum 77 \n by itself (Mac or Unix) 0 Space/separator characters Number of lines 30 [blank] 466 EOL at EOF? yes [tab] 0 [comma] (,) 0 Length of first 5 lines Control characters Line 1 13 binary 0 0 Line 2 1 CTL excl. \r, \n, \t 0 Line 3 52 DEL 0 Line 4 1 Extended (128-159,255) 0 Line 5 43 ASCII printable A-Z 28 a-z 174 File format ASCII 0-9 97 Special (!@#$ etc.) 61 Extended (160-254) 0 --------------- Total 871 Observed were: \n \r blank " ' ( ) . 0 1 2 3 4 5 6 7 8 9 A B C E H L M P R S T W ` a b c d e f g h i k l m n o p r s t u w y { }
In the output above, we can see that there are both Windows and Mac EOL characters present. We can use Stata’s filefilter command to strip out the unwanted Mac EOL characters (i.e., the first \r in the \r\r\n sequence). For example,
. filefilter auto.raw auto2.raw, from(\r\r) to(\r) replace . hexdump auto2.raw, analyze Line-end characters Line length (tab=1) \r\n (Windows) 15 minimum 2 \r by itself (Old Mac) 0 maximum 77 \n by itself (Mac or Unix) 0 Space/separator characters Number of lines 15 [blank] 466 EOL at EOF? yes [tab] 0 [comma] (,) 0 Length of first 5 lines Control characters Line 1 13 binary 0 0 Line 2 52 CTL excl. \r, \n, \t 0 Line 3 43 DEL 0 Line 4 51 Extended (128-159,255) 0 Line 5 56 ASCII printable A-Z 28 a-z 174 File format ASCII 0-9 97 Special (!@#$ etc.) 61 Extended (160-254) 0 --------------- Total 856 Observed were: \n \r blank " ' ( ) . 0 1 2 3 4 5 6 7 8 9 A B C E H L M P R S T W ` a b c d e f g h i k l m n o p r s t u w y { }
We can see that all "\r\r" sequences were replaced by "\r", which yields a new "\r\n". Now our file contains valid Windows EOL sequences.
The new file, named auto2.raw, can now be read into Stata with its accompanying data dictionary by using the infile command. For example,
. infile using auto2.raw, clear (output omitted) . list +-----------------------------------------------------------------+ | make price mpg rep78 headroom trunk weight | |-----------------------------------------------------------------| 1. | AMC Concord 4099 22 3 2.5 11 2930 | 2. | AMC Pacer 4749 17 3 3 11 3350 | 3. | AMC Spirit 3799 22 . 3 12 2640 | 4. | Buick Century 4816 20 3 4.5 16 3250 | 5. | Buick Electra 7827 15 4 4 20 4080 | |-----------------------------------------------------------------| 6. | Buick LeSabre 5788 18 3 4 21 3670 | +-----------------------------------------------------------------+