Title | Expanding datasets to all possible pairs | |
Author | Nicholas J. Cox, Durham University, UK |
I have a dataset containing a group variable, an individual identifier variable, and various descriptive variables. I would like to produce a new dataset based on all possible pairs of identifiers within groups; that is, the observations in a pair must be from the same group.
Let us phrase this in terms of a group variable group, an individual identifier variable id, and descriptive variables a and b. We will assume that id is unique, at least within group.
Here is an example with concocted data:
. list +----------------------+ | group id a b | |----------------------| 1. | 1 9 23 98 | 2. | 1 11 62 78 | 3. | 2 8 8 65 | 4. | 2 12 56 39 | 5. | 2 15 72 1 | |----------------------| 6. | 2 25 26 88 | 7. | 2 35 22 66 | 8. | 3 9 48 21 | 9. | 3 15 63 64 | 10. | 3 16 19 37 | |----------------------| 11. | 3 22 88 87 | 12. | 3 27 71 60 | 13. | 3 32 37 64 | 14. | 4 4 38 18 | 15. | 4 20 28 11 | |----------------------| 16. | 5 14 84 65 | +----------------------+
Consider each group i, in which there are ni observations, so that each observation can in turn be paired with each observation within the group, yielding ni * ni pairs. We will return later to consider what to do if the ni pairs in which each observation is paired with itself are not wanted, as that is easy to fix. We will also return to removing duplicate pairs if observation j with observation k and observation k with observation j are not both wanted.
We need first to expand the dataset, so that each observation is replaced by ni copies of itself. The Stata command for this is expand; see [D] expand. To obtain the group frequencies ni, we type
. sort group . by group: gen gfreq = _N
In Stata 7 and later, this can be expressed as
. by group, sort: gen gfreq = _N
after which we can type
. expand gfreq
Now the trick needed is to work on this expanded dataset so that pairs are identified properly. Focus on group 1 in our original data:
group id a b 1. 1 9 23 98 2. 1 11 62 78
In this group, we have ids of 9 and 11, so the pairs are 9 and 9, 9 and 11, 11 and 9, and 11 and 11. We need to generate the paired id, which we will call id2. The first step is to re-sort the data, as expand just adds extra observations at the end of the dataset. Because of this, they are no longer sorted by group, and we need to break ties by sorting on id:
. sort group id
After expansion and sorting, this group is now
. list if group == 1 +------------------------------+ | group id a b gfreq | |------------------------------| 1. | 1 9 23 98 2 | 2. | 1 9 23 98 2 | 3. | 1 11 62 78 2 | 4. | 1 11 62 78 2 | +------------------------------+
Again, in Stata 7 and later, we use
. by group id: gen numid2 = _n
Here we are numbering the copies of each original observation. Focus on the results of that for group 1:
. list if group == 1 +---------------------------------------+ | group id a b gfreq numid2 | |---------------------------------------| 1. | 1 9 23 98 2 1 | 2. | 1 9 23 98 2 2 | 3. | 1 11 62 78 2 1 | 4. | 1 11 62 78 2 2 | +---------------------------------------+
For id2, one acceptable solution is values of 9, 11, 9, 11; one way of getting that is by selecting the 2nd, 4th, 2nd, 4th values of id within this group. This method is mentioned because it generalizes easily to selecting observations subscripted by values of gfreq * numid2 within each group.
Let us go through that again. Here's the Stata command we need:
. by group: gen id2 = id[gfreq * numid2]
For group 1, gfreq * numid2 takes on values in turn of 2 * 12, 2 * 24, 2 * 12, and 2 * 24. We could put those values in a new variable, but that is not necessary because Stata can handle expressions within subscripts. Because generate is being executed under by group:, subscripts are interpreted within each distinct value of group. Hence id[2] in this instance is interpreted successively as the second value of id, within each of groups 1 to 5. See [U] 13.7 Explicit subscripting for more explanation.
Having done that, it is easy to see how the same idea can be applied to each descriptive variable. Let us suppose that a2 and b2 are not in use as variable names. Then we can do the remaining assignments with the following commands:
. by group: gen a2 = a[gfreq * numid2] . by group: gen b2 = b[gfreq * numid2]
Two detailed points remain. Suppose that we are not interested in pairs for which id and id2 are identical:
. drop if id == id2
Suppose that we do not want both (j, k) and (k,j):
. drop if id > id2
(or use <).
A note of caution: expansion can result in much larger datasets. You may need to worry about memory. For example, if each group frequency is 10, then each group expands to 45, 90, or 100, depending on the precise definition of pair, and much larger expansion factors are clearly possible.