Title | Stata 6: Generating variables that contain repeating sequences of numbers | |
Author | David Reichel, StataCorp |
Sometimes, it is valuable to generate a variable that contains a sequence of numbers in a particular pattern. Such a variable could be used as part of a match-merge procedure to give a certain shape or structure to the resulting dataset. For example, it may be useful to create a variable that contains observation identifiers or an automatic numbering of levels of factors or categorical variables.
The fill() function of the egen command is remarkably useful for this purpose. To create a variable that repeats the pattern
10 10 12 12 20
you could write the following commands:
set obs 1000 egen seq = fill(10 10 12 12 20 10 10 12 12 20)
This would create a variable seq with 1000 observations, which would repeat the sequence 200 times. A somewhat complicated pattern considering it must be repeated twice inside of the parentheses to inform Stata of the exact pattern desired.
Please note:
Two commands developed by N. J. Cox are also useful. The first is the seq command (Stata Technical Bulletin 37, dm44), which can be downloaded for free (type help net for details). seq creates a new variable that contains a sequence of integers such as
1 2 3 1 2 3 1 2 3
or
1 1 1 2 2 2 3 3 3
The command can specify the beginning number (f), the ending number (t), and how many times each number is repeated (b). For example, the two sequences above can be generated by the commands
seq a, f(1) t(3)
and
seq b, f(1) t(3) b(3)
This command can use initial integers other than 1 and can produce decreasing sequences. It also supports by, if, and in.
A similar function can be found in Stata Technical Bulletin 50, dm70, “Extensions to generate, extended”, by N. J. Cox. The syntax is different. It requires the egen command and also uses the seq command but with parentheses added. For example,
egen d = seq(), f(10) t(12)
generates the sequence:
10 11 12 10 11 12 10 11 12
Although slightly more complex to use, this command is designed to give more consistent results with datasets that require sorting.
There are two additional functions to consider. They are both associated with the generate command.
To generate a sequence of numbers like
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
the mod(x,y) function can be used. This function returns the remainder when x is divided by y. Different sequences of consecutive numbers can be generated by using an expression that includes _n for x and by setting y equal to the total number of observations within each repeated pattern. _n is called an “underscore variable”. It is a built-in system variable that contains the number of the current observation.
For example, to generate the repeated sequence above, type
gen seq2 = mod(_n-1,6) + 1
It is valuable to experiment with the mod() function to see what results can be obtained. For example, try using _n instead of _n-1 in the formula, and try removing the + 1 at the end of the function. To increment by two instead of by one, simply multiply the right side of the equation by 2 and add 2:
gen seq3 = 2*mod(_n-1,6) + 2
This will generate the following sequence:
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
The fill() function might be easier to use for simple sequences such as these. If the sequence involves consecutive integers, the seq() function can handle long repeating patterns, which would be tedious to type out using the fill() command. However, if you wanted to generate non-consecutive numbers (like the above example) from one to one thousand and do it many times, using the mod(x,y) function would save typing.
To repeat each number a specific number of times, specify the block number in the seq() command (as discussed above), or use the int(x) function. This function returns the integer obtained by truncating x. Thus, int(5.2) is 5. If you want the following repeated pattern
1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9
the command is
gen seq = int((_n-1)/2) +1
Again, it is valuable to experiment with using _n instead of _n-1 and also eliminating the + 1 at the end. You can also multiply the right side of the equation by any constant to make the sequence increment by larger or smaller steps between groups of numbers. Dividing by a number other than 2 can change the length of each repeated group.
What if you need a variable that repeats values within a sequence and repeats the sequence itself? For example,
1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3
In this sequence, the fill() or the seq() functions would still be the easiest to use, but I will demonstrate an alternative procedure.
The mod(x,y) function and the int(x) function can be used together. The mod(x,y) function helped to create a sequence that incremented by a given amount and was repeated. The int() function allowed us to repeat values within that sequence. To create the above sequence, type
gen seq = int((mod(_n-1,6))/2) + 1
Notice you can change the length of the sequence by changing the number 6, and you can change the number of times that each value repeats by changing the number 2.
To generate a sequence like
1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3
you can change the 6 to a 9 and the 2 to a 3.
gen seq = int((mod(_n-1, 9))/3) +1
There are other useful commands for special circumstances. The group() function of the egen command is described in a FAQ written by N. J. Cox and W. Gould entitled "How do I create individual identifiers numbered from 1 upwards?"