Title | Calculating the number of distinct values | |
Author | Nicholas J. Cox, Durham University, UK |
I have data collected in sequence like this:
. list +------+ | x | |------| 1. | cd1 | 2. | cd2 | 3. | cd2 | 4. | cd3 | 5. | cd1 | |------| 6. | cd3 | 7. | cd4 | 8. | cd1 | 9. | cd5 | 10. | cd3 | +------+
I want to keep track of the number of distinct values seen so far in the sequence. This number increases from 1 at observation 1 (cd1 first occurs), to 2 at observation 2 (cd2 first occurs), to 3 at observation 4 (cd3 first occurs), and so forth.
You can do the above by using by:, which is one of the most versatile features of Stata.
One clue to by: being useful here is the structure of a grouping of the variable x into several distinct values. All we need to do is tag the first occurrence of each distinct value, and then count those first occurrences in sequence.
by: goes hand in hand with sorting. We should keep a record of the current order of observations, because we will want to return to it. If the dataset already includes a time, or other identifier indicating sequence, we can use that. Otherwise, generate a variable recording current order
. generate order = _n
If your dataset is really big, that should be
. generate long order = _n
We will sort into groups of x and ensure that within those groups the original order of observations is followed. Then we tag the first occurrence of each value of x. This process can all be telescoped into one statement:
. by x (order), sort: generate y = _n == 1
That statement can be thought of as a condensed version of
. sort x order . by x: gen y = _n == 1
The sort order is first by x and then by order. Then within groups of x, the first observation is tagged as 1; all others within the same group are tagged by 0.
Let us take this more slowly: Under by:, the observation number _n is determined within the groups defined. Thus _n starts over at 1 each time a new group is encountered. So _n is 1 if an observation is the first in its group. _n == 1 is true for all such first observations. Any true or false condition is evaluated numerically in Stata as 1 if true and 0 if false. For more detail on that principle, see the FAQ: What is true and false in Stata?.
After that, we need to sort to the original order. Then we need a running sum of y because the number of distinct values seen so far is equal to the number of first occurrences seen so far.
. sort order . replace y = sum(y)
order has served its purpose.
. drop order
What do we have now?
. list +----------+ | x y | |----------| 1. | cd1 1 | 2. | cd2 2 | 3. | cd2 2 | 4. | cd3 3 | 5. | cd1 3 | |----------| 6. | cd3 3 | 7. | cd4 4 | 8. | cd1 4 | 9. | cd5 5 | 10. | cd3 5 | +----------+
With a little more knowledge, we could wrap that into a command, or an egen function, but, in many ways, it is better to use the code here and understand its logic, which will help for that next problem with a similar flavor.
The key construct here is by:. The documentation for by: is scattered around the manuals. A tutorial bringing together the main ideas is given in Cox (2002), which explains the use of the construct to tackle a variety of problems with group structure, ranging from simple calculations for each of several groups to more advanced manipulations that use the built-in _n and _N.