Title | Reading fixed-format data with infile | |
Author | James Hardin, StataCorp |
You need to use a dictionary to read in fixed-format data. Creating a dictionary can be confusing if you get caught up in all the gory details. We can offer some advice that will handle most of the text files that you encounter. Some other special cases will be addressed in the later examples.
The best advice to solve almost all the infile problems that you encounter when reading fixed-format files is to do the following:
Note: The most frequent cause of confusion for users is that the “storage type” has nothing to do with how a field is read from the text file. It only affects how that field is stored in Stata after it is read from the text file. To control how the field is read from the text file, use the “read format”.
For example, specifying that a variable should be stored as type str25 means that Stata should read 25 columns of information when it processes the text file.
To read the data from the test1.raw file below,
1101 0111 1100
you can use the following dictionary in the file test1a.dct:
dictionary using test1.raw { _column(1) byte b1 %1f _column(2) byte b2 %1f _column(3) byte b3 %1f _column(4) byte b4 %1f }
You could also use the dictionary in file test1b.dct:
dictionary using test1.raw { _column(2) byte b2 %1f _column(1) byte b1 %1f _column(4) byte b4 %1f _column(3) byte b3 %1f }
The only difference in these two approaches is in the order that the variables are stored in Stata.
. clear . quietly infile using test1a . list +-------------------+ | b1 b2 b3 b4 | |-------------------| 1. | 1 1 0 1 | 2. | 0 1 1 1 | 3. | 1 1 0 0 | +-------------------+ . clear . quietly infile using test1b . list +-------------------+ | b2 b1 b4 b3 | |-------------------| 1. | 1 1 1 0 | 2. | 1 0 1 1 | 3. | 1 1 0 0 | +-------------------+
This example also shows that you can access the columns of the text file in any order.
Now let us say that you want to read in the data from the file test2.raw:
C1245A101George Costanza B1223B011Cosmo Kramer
In this file, we have documentation from the person that supplied the data that
So, we can prepare a dictionary like this:
dictionary using test2.raw { _column(1) str5 code %5s _column(2) int call %4f _column(6) str1 city %1s _column(7) int neigh %3f _column(10) str16 name %16s }
This example shows that you can reread columns, placing their contents into different variables. Although the data look much more complicated in this example, our approach of always giving four properties makes our dictionary easy to read and easy to match the documentation that came with our data.
. clear . quietly infile using test2 . list +-----------------------------------------------+ | code call city neigh name | |-----------------------------------------------| 1. | C1245 1245 A 101 George Costanza | 2. | B1223 1223 B 11 Cosmo Kramer | +-----------------------------------------------+
Here we introduce records that extend more than one line in the text file. The only additional responsibility that we have when we make our dictionary is that we must specify at what point the record extends to the new line. Consider the data in test3.raw:
Jonathan Swift 12345 South Mockingbird Detroit, Michigan 1010111 e e cummings 4123 Elm Buffalo, New York 1101210
Our data documentation that accompanied this file tells us that
For these data, we can prepare the dictionary:
dictionary using test3.raw { _column(1) str15 name %15s _newline _column(1) str30 addr %30s _newline _column(1) str20 city %20s _newline _column(1) byte yesno1 %1f _column(2) byte yesno2 %1f _column(3) byte yesno3 %1f _column(4) byte yesno4 %1f _column(5) byte yesno5 %1f _column(6) byte yesno6 %1f _column(7) byte yesno7 %1f }
After a _newline, we start over when we refer to the _column(#) at 1. Here is the result:
. clear . quietly infile using test3.dct . list +-----------------------------------------------------------------------+ 1. | name | addr | city | yesno1 | | Jonathan Swift | 12345 South Mockingbird | Detroit, Michigan | 1 | |-----------------------------------------------------------------------| | yesno2 | yesno3 | yesno4 | yesno5 | yesno6 | yesno7 | | 0 | 1 | 0 | 1 | 1 | 1 | +-----------------------------------------------------------------------+ +-----------------------------------------------------------------------+ 2. | name | addr | city | yesno1 | | e e cummings | 4123 Elm | Buffalo, New York | 1 | |-----------------------------------------------------------------------| | yesno2 | yesno3 | yesno4 | yesno5 | yesno6 | yesno7 | | 1 | 0 | 1 | 2 | 1 | 0 | +-----------------------------------------------------------------------+
Another piece of advice for reading large text files is to use in exp to limit the dictionary to read just one observation. This limit will allow you to test your dictionary and see if it is working properly.
. infile using test1a in 1 dictionary using test1.raw { _column(1) byte b1 %1f _column(2) byte b2 %1f _column(3) byte b3 %1f _column(4) byte b4 %1f } (1 observations read) . list +-------------------+ | b1 b2 b3 b4 | |-------------------| 1. | 1 1 0 1 | +-------------------+
Since that looks OK, I might continue reading in the entire dataset, or I might read in the first five lines to further test my dictionary, which brings up an important point. If you get documentation with the data that you are trying to read into Stata, you should always use the assert command to check that the data follow the description set out in the documentation. For instance, in the previous example, the documentation said that there were 7 yes/no questions coded as 1=yes and 0=no. After reading in your data, you should check that
. assert yesno1==0 | yesno1==1 . assert yesno2==0 | yesno2==1 . assert yesno3==0 | yesno3==1 . assert yesno4==0 | yesno4==1 . assert yesno5==0 | yesno5==1 1 contradiction out of 2 assertion is false r(9); . assert yesno6==0 | yesno6==1 . assert yesno7==0 | yesno7==1
As you can see, one of the assertions was invalid. That might mean that our dictionary is wrong. On the other hand, it could mean that the documentation that came with your data is wrong. Regardless, we should note that this discrepancy exists and question the data provider about it.