Title | Reading a hierarchal dataset with infile | |
Author | William Gould, StataCorp |
In hierarchical datasets, the records do not all have the same format, nor do they contain the same type of information. In computer jargon, the file is organized so that the number and types of fields depend on the record types. For example, you may have a dataset that has records for families and records for persons within family.
To read these types of datasets, you need to create a dictionary for each type of record. Then read in all the data by using each of the dictionaries and keep only the appropriate data. Once you have these two Stata datasets, you can merge them. This is a difficult data management problem, but it is manageable if you break the problem into several steps. We cover this topic in an extended example in lecture 1 of NetCourse 151. Below is an excerpt from that lecture.
Begin quotation from NetCourse 151
Now let’s consider reading a hierarchical dataset with Stata. This is tricky.
Suppose that you have data on families and persons within families. The data have the format family record followed by one or more person records:
family record
person record
...
person record
family record
etc...
Let’s assume that
family record: | person record: | ||
---|---|---|---|
col. 1–5 | family id | col. 1–5 | person id |
col. 7 | "1" | col. 7 | "2" |
col. 9 | dwelling type code | col. 8–9 | age |
col. 11 | sex code |
Note: Sample data are at the end of this lecture, after the exercises. You should use those data to test your ability to read this kind of dataset.
The data probably contain more information than this, but this is enough for illustration. If column 7 contains a 1, it is a family record, and if it contains a 2, it is a person record. This is called the record-type indicator.
I want to create a Stata .dta set containing
My dataset will contain one observation per person, and the family information will be repeated for persons in the same family.
First, I will create separate dictionaries for reading the family and person information:
DICTIONARY: family.dct |
---|
dictionary using hier.raw { long famid %5f "family id" _column(7) byte rectype %1f "record type" _column(9) byte dwell %1f "dwelling code" } |
DICTIONARY: person.dct |
---|
dictionary using hier.raw { long perid %5f "person id" _column(7) byte rectype %1f "record type" _column(8) byte age %2f "age (years)" _column(11) byte sex %1f "sex code" } |
I will then test each one of these dictionaries, to make sure they work:
. clear . infile using family if rectype==1 in 1/100 . list in 1/5 . type hier.raw <- I'll press Break to stop this . clear . infile using person if rectype==2 in 1/100 . list in 1/5 . type hier.raw <- I'll press Break to stop this
What I’m doing above is reading a few data, typing the original, and comparing them.
Satisfied that I have good dictionaries, I then create a do-file to read the entire dataset. The basic plan of my do-file is to
This problem would be easy if the person records contained the family id to which they belonged—step 1 would be an infile ... if rectype==1 followed by a sort and save, and step 2 would be an infile ... if rectype==2 followed by a sort, and step 3 would be a merge. My whole do-file would be
STEP 1 clear infile using family if rectype==1 sort famid save tmph, replace STEP 2 clear infile using person if rectype==2 sort famid STEP 3 merge m:1 famid using tmph
In my example, however, famid does not appear on the person records (just as it does not in data released by the U.S. Bureau of the Census and the U.S. Bureau of Labor Statistics).
This adds significantly to the complication. I’m going to read the family records as I did above, but in addition, I’m going to manufacture my own family ID variable, labeling the first family 1, the second 2, and so on:
MODIFIED STEP 1 clear infile using family if rectype==1 gen long id = _n sort id save tmph, replace
Next, when I read the person data, I am going to read the family records as if they were person records, too. The result will be that I have a placeholder observation for the family record:
perid | rectype | age | sex | |
1. | junk | 1 | junk | junk |
2. | 1 | 2 | 32 | 0 |
3. | 2 | 2 | 30 | 1 |
4. | junk | 1 | junk | junk |
5. | 1 | 2 | 40 | 1 |
etc. |
I will then regenerate my temporary family ID variable and discard the misread family records.
To regenerate the id variable, I will first gen id = 1 if rectype == 1,
perid | rectype | age | sex | id | |
1. | junk | 1 | junk | junk | 1 |
2. | 1 | 2 | 32 | 0 | . |
3. | 2 | 2 | 30 | 1 | . |
4. | junk | 1 | junk | junk | 1 |
5. | 1 | 2 | 40 | 1 | . |
etc. |
and then replace id = sum(id),
perid | rectype | age | sex | id | |
1. | junk | 1 | junk | junk | 1 |
2. | 1 | 2 | 32 | 0 | 1 |
3. | 2 | 2 | 30 | 1 | 1 |
4. | junk | 1 | junk | junk | 2 |
5. | 1 | 2 | 40 | 1 | 2 |
etc. |
and finally, drop if rectype == 1
perid | rectype | age | sex | id | |
1. | 1 | 2 | 32 | 0 | 1 |
2. | 2 | 2 | 30 | 1 | 1 |
3. | 1 | 2 | 40 | 1 | 2 |
I will then be able to merge my family data with my person data. So, the outline of my do-file is
MODIFIED STEP 1 clear infile using family if rectype==1 gen long id = _n sort id save tmph, replace MODIFIED STEP 2 clear infile using person /* no if! */ gen long id = 1 if rectype==1 replace id = sum(id) sort id MODIFIED STEP 3 merge m:1 id using tmph drop id
Some final details:
My outline assumes that rectype takes on the values 1 and 2 as the documentation claims and that everything merges. I need to include some checks. Thus my final do-file is
DO-FILE: crhier.do |
---|
capture log close log using crhier, replace clear infile using family if rectype==1 drop rectype gen long id = _n /* make my own temporary id var */ sort id /* to set sort markers */ save tmph, replace clear infile using person /* no matter what the rectype */ assert rectype==1 | rectype==2 /* just to be safe -- see note */ gen long id = 1 if rectype==1 replace id = sum(id) drop if rectype==1 drop rectype sort id perid merge m:1 id using tmph assert _merge==3 /* they are supposed to match */ drop _merge id sort famid perid save hier, replace erase tmph.dta log close exit |
Let me direct your attention to the line that reads
assert rectype==1 | rectype==2
This line is an important part of my do-file. Everything I’m doing hinges on the documentation being correct; that is, rectype really does take on the values 1 and 2, and only the values 1 and 2, and that I am correctly reading rectype (I’m reading the right columns of the data). In my do-file, if rectype ever takes on a value other than 1 or 2, things stop right there.
Similarly, after merging, I include the line
assert _merge==3
Theoretically, this line must be true, but in reality, I sometimes make mistakes. Asserting things that must be true is a good way to catch bugs.
Early on, I should verify more about the data. For instance, the documentation implies that there are no empty households, meaning two household records in a row. I could prove this by including the line
assert rectype!=1 if rectype[_n-1]==1
The final thing you need to know is that Stata’s infix command and infile with a data dictionary are really the same command—both read the data in record mode. I have chosen to use infile in the example above, but I could have used infix. My dictionaries would have had a slightly different format, but the logic would be the same. All that would change in my do-file would be the switch from infile to infix.