Home  /  Resources & support  /  FAQs  /  Creating variables recording properties of the other members of a group

How do I create variables summarizing for each individual properties of the other members of a group?

Title   Creating variables recording properties of the other members of a group
Author Nicholas J. Cox, Durham University, UK

As from 2016, the community-contributed program rangestat (SSC) offers an alternative to solutions here. You could well install it and study its detailed help.

1. Examples: data on families

Suppose you have data on families. For each person in each family, it may be useful to calculate variables that summarize properties of the other members of the same family. How many other children are there? What is their average, maximum, or minimum age? Is there an older child or a younger child? The more general problem can be described as summarizing properties, for each individual, of the other members of the same group.

Let us look at some invented data. For what follows, it is essential to have a group identifier, so, in this example, we have an identifier for each family. It is not always essential to have an individual identifier, but what follows does depend on each person occurring just once in the dataset. In practice, however, such data usually include individual identifiers.

        family     person        sex        age       
  1.         1          1          1         36
  2.         1          2          1         16
  3.         1          3          1         14

  4.         2          1          0         45
  5.         2          2          1         42
  6.         2          3          0         14
  7.         2          4          1         12
  8.         2          5          0         10 

  9.         3          1          0         39
 10.         3          2          1         36
 11.         3          3          0         11
 12.         3          4          1          9
 13.         3          5          1          7
 14.         3          6          1          3

We will suppose that sex is recorded as 1 for female and 0 for male. Such 0–1 coding is in a sense arbitrary, but it makes life easier, especially for statistical modeling in which the response is a binary variable and (more directly important here) for counting values within each group.

2. Specific problem: for each child, how many other children are there?

Let us define children as those whose age is 17 and under. For each child, how many other children are there? This is simply the number of children in the family, minus 1 if each person is a child. (In family 3, with 4 children, for each child there are 3 other children.)

For any calculation like this, it is always worth looking to see whether egen provides an answer to at least part of the problem. Many functions have been written for egen. In particular, egen, total() by() is natural for producing totals, including counts, separately for groups defined by one or more variables specified as arguments to by(). egen, count() by() is also often useful but is a little less general in application, so we will concentrate here on total(). total() in Stata 9 and later releases is a replacement for sum() in Stata 8.

 . egen nchild = total(age <= 17), by(family) 
 . replace nchild = nchild - (age <= 17) 

age < = 17 will be true (evaluates to 1) whenever age is less than or equal to 17, and false (evaluates to 0) otherwise. Adding up the 1s and 0s within egen, total() is the same as counting the observations for which age <= 17. We then subtract age <= 17 from each observation. The effect of the by(family) option is to count within families, each family being a group of observations with the same value of family. The effect of the replace correction is confined to individual observations.

The syntax for egen indicates that total() works on an expression exp. The argument need not be a single variable but can usefully be something more complicated. Being interested only in other female children is not any more difficult:

 . egen nsisters = total(age <= 17 & sex == 1), by(family)
 . replace nsisters = nsisters - (age <= 17 & sex == 1) 

This solution also assigns values to adults, those with age greater than or equal to 18. This could be useful, or not useful, depending on your substantive problem. If you wanted to exclude adults completely from the calculation, you could specify if age <= 17 on the egen command, and values for adults would then be missing (.).

If we wanted to count not “other children” but “other adults”, we should be a little more careful. The expression age >= 18 includes missing values for age, as in Stata missing counts higher than any other numeric value. Often we will want to exclude those with the condition age >= 18 & age < . unless we know we can treat missing ages as adults.

3. Generic problem: totals and means

Other totals, and by extension means, can be calculated using the same general approach. Put simply,

  1. Calculate the total for each group.
  2. Subtract each member’s contribution from that total (possibly, the contribution is 0).
  3. If needed, calculate the mean as the total divided by the number of values.

What is the average age of the other children in each family? Here is one solution:

 . egen totalage = total(age) if age <= 17, by(family)
 . replace totalage = totalage - age 
 . generate meanage = totalage/nchild

This solution excludes the adults. Not only are they not included in the summation of age, but they also receive missing values for the result. In the replace command, we can be cavalier about excluding or including the adults; either way, the missing values will not be changed.

If we want to include the adults—that is, we want a record for each adult of the average age of the children—here is a solution:

 . egen totalage = total(age * (age <= 17)), by(family) 
 . replace totalage = totalage - age * (age <= 17) 
 . generate meanage = totalage/nchild

Here the multiplier age <= 17 says the summand is 0 whenever age is 18 or more, so the total is the correct total and is assigned to all observations in each family.

4. Generic problem: other statistics

What we have done so far hinges rather delicately on two properties of sums: first, the sum for “everybody else” is just the sum for “everybody” minus the sum (the value) for this observation; and second, that the value of a sum is not affected by adding or subtracting 0. When we turn to other summary statistics, we can no longer rely on these properties. We need a more general approach.

In broad terms, we need to do the work within a loop:

 for each member in the family { 
         calculate a statistic from data on the family 
         assign the result to that member of that family 
 } 

5. Specific problem: maximum age of the other children

Let us suppose that we want to know, for each child, the maximum age of the other children in the same family. Within the loop, we will find ourselves assigning chunks of values: for that task, we cannot use generate repeatedly. We can use replace repeatedly, so we need to generate a variable before we can do that:

 . generate maxage = .  

Next we need an identifier running from 1 and above to assign to each person in the family. In our little dataset, there was already such an identifier, but, if there was not, one could easily be created using by with the sort option:

 . by family, sort: gen pid = _n 
 . summarize pid

Under by varlist: _n is interpreted within each group of observations, not for the whole dataset. For this problem, it does not matter that pid is arbitrary; we just need a systematic way of doing the calculations in turn for each member of the family. The summarize shows us the maximum value of pid, which we will need shortly. We could also pick up the value of the maximum as r(max), which is important for any automation of the whole process.

Within the loop, we need a way of excluding each value of pid from the calculation. Here is one way to do it, using forvalues:

 . quietly forvalues i = 1/`r(max)' { 
 .       generate include = 1 if pid != `i' & age <= 17 
 .       egen work = max(age * include), by(family) 
 .       replace maxage = work if pid == `i' 
 .       drop include work 
 . } 

The forvalues construct loops over values of the local macro i, which is set in turn to 1, then to 2, and so on, up to the maximum of pid as returned by summarize. The macro is automatically incremented each time through the loop. In practice, most Stata programmers use the abbreviation forval. Within the loop, the value of i is referred to as 'i'. The generate statement produces a variable that is 1 if the observation is to be included in the calculation and missing otherwise. The expression age * include, which is then fed to egen, max(), is age * 1 or age when include is 1, and age * . or missing . when include is missing. What egen, max() does is exclude missings from the calculation, and, only if all the values in each group are missing, will the maximum be returned as missing. Although Stata has a general rule that numeric missing is larger than any other numeric value, it assumes when calculating maxima that you really want the largest nonmissing value. (See what happens when you type display max(1,2,_pi,42,.).) We then use the result of that calculation to replace the maxage value for the current member of the family. Finally, it is easiest to drop the variables include and work so that Stata can start afresh next time around the loop.

Why is this loop not the following code?

 . quietly forvalues i = 1/`r(max)' {
 .       egen work = max(age) if age <= 17 & pid != `i', by(family) 
 .       replace maxage = work if pid == `i' 
 .       drop work 
 . } 

The reason this will not work as desired is the result of the egen calculation will be missing for observations excluded by the if condition. In fact, the result of the loop is that all values of maxage will be missing.

For each child, there is an older one (strictly, one or more) if maxage is greater than age,

 . generate olderch = maxage > age if age <= 17 

and we could use a similar approach to get the minimum age of the other children and thus to determine whether there are younger children.

The same general scheme can be used for other egen functions that take an expression exp as an argument and allow by() as an option; see egen.

6. Specific problem: how many of a person’s own children are in the family?

Consider a family survey in which we do not have direct information about the number of children of each person. We do have variables for family ID family and individual ID person and also for father ID fatherm and mother ID motherm (which are missing if a person’s mother or father is not a member of the same family). Thus in the example,

 family person fatherm  motherm
   1      1       .        .
   1      2       .        .
   1      3       1        2
   1      4       1        2
   1      5       1        2
  
   2      1       .        .
   2      2       .        1
   2      3       .        2

family 1 includes a couple and three children, all of whom are children of the same mother and father, whereas family 2 includes a grandmother, her daughter, and a grandchild—the son or daughter of that daughter.

The problem is to create a variable ownchild giving the number of each person’s own children living in the family. Thus in family 1, both parents have three children living with them, whereas in family 2, both the grandmother and her daughter have one child each living with them.

We first find the number of children of each father and each mother:

 . by family fatherm, sort: gen fchild = _N if fatherm < .
 . by family motherm, sort: gen mchild = _N if motherm < .

Under by varlist: _N is interpreted within each group of observations, not for the whole dataset. Now we initialize the variable to be produced and a variable we will need to produce it. Both can be byte variables:

 . gen byte ownchild = 0
 . gen byte ischild = 0

We are going to loop over the values of person within each family. We can see in the example that these range from 1 to 5, but, more generally, we can pick up the maximum from summarize, like in the previous problem:

 . summarize person, meanonly

The main loop is like this, which we will look at first and then unpack:

 . forval i = 1 / `r(max)' { 
 .         replace ischild = (fatherm == `i') | (motherm == `i')
 .         #delimit ;   
 .         qui by family (ischild), sort:
 .         replace ownchild =
 .         cond(motherm[_N] == `i', mchild[_N], fchild[_N])
 .	  if person == `i' & ischild[_N] ; 
 .         #delimit cr 
 . }

As we go around the forvalues loop, the local macro i is varied from 1 to the maximum observed person, which we pick up as r(max). Here we are capitalizing on the fact that person takes small integers from 1 and above within each family. Later, we will look at a method for mapping arbitrary identifiers to this set-up. What may look like a special case is a step away from any identifier scheme.

Follow through as we start the loop with `i' and also person equal to 1. Members of each family are children of this person if he or she is their father or their mother. forval substitutes 1 for `i':

 . replace ischild = (fatherm == 1) | (motherm == 1)

This indicator variable will be 0 (is not a child of 1) or 1 (is a child of 1). For more explanation of indicator variables as showing true or false, see http://www.stata.com/support/faqs/data-management/true-and-false/.

Within each family, we are going to sort on this variable, so that all the children of person 1 come at the end of each family. Then we can pick up the number of children from the other variables in the last observation, subject to conditions to be mentioned in a moment.

 qui by family (ischild), sort:
 replace ownchild =
 cond(motherm[_N] == `i', mchild[_N], fchild[_N])
 if person == `i' & ischild[_N]

This is a lot of information in one statement and is best taken in pieces:

  • qui by family (ischild), sort:

    We are going to do a replace separately by families (recall that family is the family identifier). Within each family, we sort first on ischild so that any children of person 1 go to the end of the family. As always, sort puts lowest values first, so all values of 0 come before all values of 1 for indicator variables such as ischild. Also, we do all this quietly, although that is not essential.
  • replace ownchild = ... if person == 1 & ischild[_N]

    We are going to replace ownchild but only for observations with person equal to 1 and only if the last person in the family is a child of this person. As before, under by varlist: _N is interpreted within each group defined by varlist. Hence ischild[_N] is the value for the last person in each family. (ischild[_N] is a shortcut for ischild[_N] == 1 as they always evaluate to the same result. For more, see the FAQ just cited.)
  • What are we going to replace ownchild with? The condition ischild[_N] ensures that we will only replace values when the last observation in each family is for a child of any person for whom person is 1. If that person is a mother, we use the value for mchild; if not, we use the value for fchild:

    cond(motherm[_N] == `i', mchild[_N], fchild[_N])

We went through the operations for person equal to 1. forvalues automatically repeats them for the other values of person.

7. Mapping from arbitrary identifiers to integers 1 and above

We have seen that for some problems there is an advantage in using integer identifiers which run from 1 and above within each group. If such identifiers do not exist, they can be created, as seen in section 5.

What needs more explanation is how to map arbitrary existing identifiers to this setup. Suppose that the identifiers were, say,

 family      person     fatherm     motherm
   1          1001           .           .
   1          1002           .           .
   1          1003        1001        1002
   1          1004        1001        1002
   1          1005        1001        1002
   2          2001           .           .
   2          2002           .        2001
   2          2003           .        2002

First, we generate integers from 1 and above as before

 . by family (person), sort: gen pid = _n

We need to map fatherm and motherm to consistent identifiers. We initialize the variables we want

 . gen byte fid = .
 . gen byte mid = .

Now our main loop is to cycle through the values of pid, which by construction contains integers 1 and above. We replace fid and mid by each value as appropriate:

 . summarize pid, meanonly
 . qui forval i = 1 / `r(max)' {
 .	#delimit ;
 .	by family: replace fid = `i'
 .	if fatherm == person[`i'] & !missing(fatherm) ;
 .	by family: replace mid = `i'
 .	if motherm == person[`i'] & !missing(motherm) ;
 .       #delimit cr
 .       }

That is, by cycling through all the values of pid, we are also cycling through all the values of person. Although the example dataset contains numeric identifiers for person, fatherm, and motherm, the code is general enough to apply to string identifiers as well.

Doing this by family: covers the case in which a value of person is unique for a person within a family but may also be a identifier for another person in another family. That is, one person may be person 1 in one family and another person may also be person 1, but in another family. Alternatively, if person has a unique value for each person in the dataset, we lose nothing by doing this under by:, except that possibly it may be a little slower in machine time.

The extra conditions & !missing(fatherm) and & !missing(motherm) are needed. Why? In the example, family 1 has 5 members and family 2 has 3 members. When the forval loop gets to 4, we are using the conditions if fatherm == person[4] and if motherm == person[4]. Under by family: subscripting is interpreted within groups defined by family, but there is no 4th observation for family 2. Stata evaluates person[4] as missing in this circumstance, but we then have a problem in that any values of fatherm or motherm that are missing will get mapped to 4. To prevent this mapping, we add the extra condition that the variable in question must not be missing.

8. Acknowledgment

Thanks to Guillermo Cruces for posing the problem in sections 6 and 7.