Title | Sorting on categorical variables | |
Author | William Gould, StataCorp |
There is really no general answer to this question other than your program has an error in it. There is, however, one common error even experienced Stata users make:
If you sort on a variable that does not have unique values for every observation in the data and subsequently refer, implicitly or explicitly, to the order within group (say, by referring to _n or _N with by), the results will vary every time you run the file.
Consider the following dataset:
. sort group . list +-----------------+ | group x1 x2 | |-----------------| 1. | 1 5 7 | 2. | 1 2 6 | 3. | 1 3 9 | 4. | 2 1 2 | 5. | 2 7 4 | +-----------------+
The first value of x1 in the first group is 5. Now let us jumble up these data (we will sort on x1) and then sort the data again by group:
. sort x1 . sort group . list +-----------------+ | group x1 x2 | |-----------------| 1. | 1 3 9 | 2. | 1 5 7 | 3. | 1 2 6 | 4. | 2 1 2 | 5. | 2 7 4 | +-----------------+
Before, the first value of x1 in the first group was 5, now it is 3. Why the change? Because group takes on repeated values across observations, we said sort group, and we did not say how the data should be sorted within group. Since we did not specify, Stata chose an order at random.
People have sent us do-files that contain
... sort patid quietly by patid: keep if age[1]>20 ...
The intent of these lines was to select patients who were at least age 20, but that is not what the user got and, moreover, the user got a different sample every time he ran the do-file. The problem was that patid took on repeated values, so saying sort patid was not enough to specify what the order should be within patid. The user meant to code
... sort patid age quietly by patid: keep if age[1]>20 ...
or
... sort patid time quietly by patid: keep if age[1]>20 ...
Now, pretend that, rather than keeping all patient records, we wanted to just keep the first record.
... sort patid age quietly by patid: keep if _n==1 ...
Sorting on both patid and age might not be sufficient because each patient might have multiple records with the same age. We would be selecting one record at random from the earliest records for each patient. If our data included variable time and time was unique within patient,
... sort patid time quietly by patid: keep if _n==1 ...
would be better.
In other words, be careful. There is nothing wrong with sorting on categorical variables by themselves—sort patid and sort group—just do not assume that the order within the grouping variable is unique. Be especially careful when selecting observations within groups.