Title | Creating variables recording properties of the other members of a group | |
Author | Nicholas J. Cox, Durham University, UK |
Suppose you have data on families. For each person in each family, it may be useful to calculate variables that summarize properties of the other members of the same family. How many other children are there? What is their average, maximum, or minimum age? Is there an older child or a younger child? The more general problem can be described as summarizing properties, for each individual, of the other members of the same group.
Let us look at some invented data. For what follows, it is essential to have a group identifier, so, in this example, we have an identifier for each family. It is not always essential to have an individual identifier, but what follows does depend on each person occurring just once in the dataset. In practice, however, such data usually include individual identifiers.
family person sex age 1. 1 1 1 36 2. 1 2 1 16 3. 1 3 1 14 4. 2 1 0 45 5. 2 2 1 42 6. 2 3 0 14 7. 2 4 1 12 8. 2 5 0 10 9. 3 1 0 39 10. 3 2 1 36 11. 3 3 0 11 12. 3 4 1 9 13. 3 5 1 7 14. 3 6 1 3
We will suppose that sex is recorded as 1 for female and 0 for male. Such 0–1 coding is in a sense arbitrary, but it makes life easier, especially for statistical modeling in which the response is a binary variable and (more directly important here) for counting values within each group.
Let us define children as those whose age is 17 and under. For each child, how many other children are there? This is simply the number of children in the family, minus 1 if each person is a child. (In family 3, with 4 children, for each child there are 3 other children.)
For any calculation like this, it is always worth looking to see whether egen provides an answer to at least part of the problem. Many functions have been written for egen. In particular, egen, total() by() is natural for producing totals, including counts, separately for groups defined by one or more variables specified as arguments to by(). egen, count() by() is also often useful but is a little less general in application, so we will concentrate here on total(). total() in Stata 9 and later releases is a replacement for sum() in Stata 8.
. egen nchild = total(age <= 17), by(family) . replace nchild = nchild - (age <= 17)
age < = 17 will be true (evaluates to 1) whenever age is less than or equal to 17, and false (evaluates to 0) otherwise. Adding up the 1s and 0s within egen, total() is the same as counting the observations for which age <= 17. We then subtract age <= 17 from each observation. The effect of the by(family) option is to count within families, each family being a group of observations with the same value of family. The effect of the replace correction is confined to individual observations.
The syntax for egen indicates that total() works on an expression exp. The argument need not be a single variable but can usefully be something more complicated. Being interested only in other female children is not any more difficult:
. egen nsisters = total(age <= 17 & sex == 1), by(family) . replace nsisters = nsisters - (age <= 17 & sex == 1)
This solution also assigns values to adults, those with age greater than or equal to 18. This could be useful, or not useful, depending on your substantive problem. If you wanted to exclude adults completely from the calculation, you could specify if age <= 17 on the egen command, and values for adults would then be missing (.).
If we wanted to count not “other children” but “other adults”, we should be a little more careful. The expression age >= 18 includes missing values for age, as in Stata missing counts higher than any other numeric value. Often we will want to exclude those with the condition age >= 18 & age < . unless we know we can treat missing ages as adults.
Other totals, and by extension means, can be calculated using the same general approach. Put simply,
What is the average age of the other children in each family? Here is one solution:
. egen totalage = total(age) if age <= 17, by(family) . replace totalage = totalage - age . generate meanage = totalage/nchild
This solution excludes the adults. Not only are they not included in the summation of age, but they also receive missing values for the result. In the replace command, we can be cavalier about excluding or including the adults; either way, the missing values will not be changed.
If we want to include the adults—that is, we want a record for each adult of the average age of the children—here is a solution:
. egen totalage = total(age * (age <= 17)), by(family) . replace totalage = totalage - age * (age <= 17) . generate meanage = totalage/nchild
Here the multiplier age <= 17 says the summand is 0 whenever age is 18 or more, so the total is the correct total and is assigned to all observations in each family.
What we have done so far hinges rather delicately on two properties of sums: first, the sum for “everybody else” is just the sum for “everybody” minus the sum (the value) for this observation; and second, that the value of a sum is not affected by adding or subtracting 0. When we turn to other summary statistics, we can no longer rely on these properties. We need a more general approach.
In broad terms, we need to do the work within a loop:
for each member in the family { calculate a statistic from data on the family assign the result to that member of that family }
Let us suppose that we want to know, for each child, the maximum age of the other children in the same family. Within the loop, we will find ourselves assigning chunks of values: for that task, we cannot use generate repeatedly. We can use replace repeatedly, so we need to generate a variable before we can do that:
. generate maxage = .
Next we need an identifier running from 1 and above to assign to each person in the family. In our little dataset, there was already such an identifier, but, if there was not, one could easily be created using by with the sort option:
. by family, sort: gen pid = _n . summarize pid
Under by varlist: _n is interpreted within each group of observations, not for the whole dataset. For this problem, it does not matter that pid is arbitrary; we just need a systematic way of doing the calculations in turn for each member of the family. The summarize shows us the maximum value of pid, which we will need shortly. We could also pick up the value of the maximum as r(max), which is important for any automation of the whole process.
Within the loop, we need a way of excluding each value of pid from the calculation. Here is one way to do it, using forvalues:
. quietly forvalues i = 1/`r(max)' { . generate include = 1 if pid != `i' & age <= 17 . egen work = max(age * include), by(family) . replace maxage = work if pid == `i' . drop include work . }
The forvalues construct loops over values of the local macro i, which is set in turn to 1, then to 2, and so on, up to the maximum of pid as returned by summarize. The macro is automatically incremented each time through the loop. In practice, most Stata programmers use the abbreviation forval. Within the loop, the value of i is referred to as 'i'. The generate statement produces a variable that is 1 if the observation is to be included in the calculation and missing otherwise. The expression age * include, which is then fed to egen, max(), is age * 1 or age when include is 1, and age * . or missing . when include is missing. What egen, max() does is exclude missings from the calculation, and, only if all the values in each group are missing, will the maximum be returned as missing. Although Stata has a general rule that numeric missing is larger than any other numeric value, it assumes when calculating maxima that you really want the largest nonmissing value. (See what happens when you type display max(1,2,_pi,42,.).) We then use the result of that calculation to replace the maxage value for the current member of the family. Finally, it is easiest to drop the variables include and work so that Stata can start afresh next time around the loop.
Why is this loop not the following code?
. quietly forvalues i = 1/`r(max)' { . egen work = max(age) if age <= 17 & pid != `i', by(family) . replace maxage = work if pid == `i' . drop work . }
The reason this will not work as desired is the result of the egen calculation will be missing for observations excluded by the if condition. In fact, the result of the loop is that all values of maxage will be missing.
For each child, there is an older one (strictly, one or more) if maxage is greater than age,
. generate olderch = maxage > age if age <= 17
and we could use a similar approach to get the minimum age of the other children and thus to determine whether there are younger children.
The same general scheme can be used for other egen functions that take an expression exp as an argument and allow by() as an option; see egen.
Consider a family survey in which we do not have direct information about the number of children of each person. We do have variables for family ID family and individual ID person and also for father ID fatherm and mother ID motherm (which are missing if a person’s mother or father is not a member of the same family). Thus in the example,
family person fatherm motherm 1 1 . . 1 2 . . 1 3 1 2 1 4 1 2 1 5 1 2 2 1 . . 2 2 . 1 2 3 . 2
family 1 includes a couple and three children, all of whom are children of the same mother and father, whereas family 2 includes a grandmother, her daughter, and a grandchild—the son or daughter of that daughter.
The problem is to create a variable ownchild giving the number of each person’s own children living in the family. Thus in family 1, both parents have three children living with them, whereas in family 2, both the grandmother and her daughter have one child each living with them.
We first find the number of children of each father and each mother:
. by family fatherm, sort: gen fchild = _N if fatherm < . . by family motherm, sort: gen mchild = _N if motherm < .
Under by varlist: _N is interpreted within each group of observations, not for the whole dataset. Now we initialize the variable to be produced and a variable we will need to produce it. Both can be byte variables:
. gen byte ownchild = 0 . gen byte ischild = 0
We are going to loop over the values of person within each family. We can see in the example that these range from 1 to 5, but, more generally, we can pick up the maximum from summarize, like in the previous problem:
. summarize person, meanonly
The main loop is like this, which we will look at first and then unpack:
. forval i = 1 / `r(max)' { . replace ischild = (fatherm == `i') | (motherm == `i') . #delimit ; . qui by family (ischild), sort: . replace ownchild = . cond(motherm[_N] == `i', mchild[_N], fchild[_N]) . if person == `i' & ischild[_N] ; . #delimit cr . }
As we go around the forvalues loop, the local macro i is varied from 1 to the maximum observed person, which we pick up as r(max). Here we are capitalizing on the fact that person takes small integers from 1 and above within each family. Later, we will look at a method for mapping arbitrary identifiers to this set-up. What may look like a special case is a step away from any identifier scheme.
Follow through as we start the loop with `i' and also person equal to 1. Members of each family are children of this person if he or she is their father or their mother. forval substitutes 1 for `i':
. replace ischild = (fatherm == 1) | (motherm == 1)
This indicator variable will be 0 (is not a child of 1) or 1 (is a child of 1). For more explanation of indicator variables as showing true or false, see http://www.stata.com/support/faqs/data-management/true-and-false/.
Within each family, we are going to sort on this variable, so that all the children of person 1 come at the end of each family. Then we can pick up the number of children from the other variables in the last observation, subject to conditions to be mentioned in a moment.
qui by family (ischild), sort: replace ownchild = cond(motherm[_N] == `i', mchild[_N], fchild[_N]) if person == `i' & ischild[_N]
This is a lot of information in one statement and is best taken in pieces:
We went through the operations for person equal to 1. forvalues automatically repeats them for the other values of person.
We have seen that for some problems there is an advantage in using integer identifiers which run from 1 and above within each group. If such identifiers do not exist, they can be created, as seen in section 5.
What needs more explanation is how to map arbitrary existing identifiers to this setup. Suppose that the identifiers were, say,
family person fatherm motherm 1 1001 . . 1 1002 . . 1 1003 1001 1002 1 1004 1001 1002 1 1005 1001 1002 2 2001 . . 2 2002 . 2001 2 2003 . 2002
First, we generate integers from 1 and above as before
. by family (person), sort: gen pid = _n
We need to map fatherm and motherm to consistent identifiers. We initialize the variables we want
. gen byte fid = . . gen byte mid = .
Now our main loop is to cycle through the values of pid, which by construction contains integers 1 and above. We replace fid and mid by each value as appropriate:
. summarize pid, meanonly . qui forval i = 1 / `r(max)' { . #delimit ; . by family: replace fid = `i' . if fatherm == person[`i'] & !missing(fatherm) ; . by family: replace mid = `i' . if motherm == person[`i'] & !missing(motherm) ; . #delimit cr . }
That is, by cycling through all the values of pid, we are also cycling through all the values of person. Although the example dataset contains numeric identifiers for person, fatherm, and motherm, the code is general enough to apply to string identifiers as well.
Doing this by family: covers the case in which a value of person is unique for a person within a family but may also be a identifier for another person in another family. That is, one person may be person 1 in one family and another person may also be person 1, but in another family. Alternatively, if person has a unique value for each person in the dataset, we lose nothing by doing this under by:, except that possibly it may be a little slower in machine time.
The extra conditions & !missing(fatherm) and & !missing(motherm) are needed. Why? In the example, family 1 has 5 members and family 2 has 3 members. When the forval loop gets to 4, we are using the conditions if fatherm == person[4] and if motherm == person[4]. Under by family: subscripting is interpreted within groups defined by family, but there is no 4th observation for family 2. Stata evaluates person[4] as missing in this circumstance, but we then have a problem in that any values of fatherm or motherm that are missing will get mapped to 4. To prevent this mapping, we add the extra condition that the variable in question must not be missing.
Thanks to Guillermo Cruces for posing the problem in sections 6 and 7.