Title | Creating variables recording whether any or all members of a group possess some characteristic | |
Author | Nicholas J. Cox, Durham University, UK |
In the simplest case, we have a binary variable recording whether, for example, persons are male or female, unemployed or employed, or whatever, and some group variable, like a variable recording a family identifier. For example,
family person female 1. 1 1 1 2. 1 2 1 3. 1 3 1 4. 2 1 0 5. 2 2 0 6. 2 3 0 7. 3 1 0 8. 3 2 0 9. 3 3 0 10. 3 4 1 11. 3 5 1 12. 3 6 1
Suppose that female is recorded as 1 for female and 0 for male. Such 0–1 coding is in a sense arbitrary but makes life easier, especially for statistical modeling in which the response is a binary variable.
Imagine various families:
From these examples, we can see a correspondence between two ways of thinking about such families:
Thus egen provides a one-line answer here to each part of the question:
. egen anyfem = max(female), by(family) . egen allfem = min(female), by(family)
anyfem or allfem will be 1 or 0 according to whether it is true (1) or false (0) that any or all in a family are female.
Real examples could be more complicated than this.
First, what if the characteristic of interest is not coded as a 0–1 variable? This approach is only barely more difficult. The syntax of egen, min() and egen, max() is that each feeds on an expression; see [D] egen. We could have typed
. egen anymale = max(female == 0), by(family) . egen allmale = min(female == 0), by(family) . egen anyDemo = max(pty == "D"), by(family) . egen allDemo = min(pty == "D"), by(family)
In other words, we can use any expression that is true or false. That expression, fed to max() or min(), will be evaluated observation by observation with a result of 1 if true or 0 if false. The expression can refer to numeric or string variables or to a combination of the two.
Second, what if missing values are present? For numeric variables, missing counts as higher than any other numeric value, but egen, max() is smart enough to ignore it. Only if all values in a group are missing will the result variable be missing.
Occasionally, you may want a strict definition of all—that literally all values in a group must possess the characteristic, with no missing values allowed. Here is one approach:
. egen anymiss = max(missing(female)), by(family) . egen allfem = min(female) if !anymiss, by(family)
Here is another:
. egen anymiss = max(female), by(family) . egen allfem = min(female), by(family) . replace allfem = 0 if anymiss
The difference is, in the first case, any family with a member with unknown sex will be coded as missing, whereas, in the second case, any family with such a member will be coded as 0.
In expressions, for example, female==0 is false (0) if female is missing (that is, female==0 does not evaluate to missing). If we had another variable in our data—grade taking on values 1, 2, 3, 4, ...—then grade>3 is true even if grade is missing. Think of missing values as positive infinity. In some instances, excluding missing values explicitly is the most appropriate specification.
. egen anyhigh = max(grade > 3 & grade < .), by(group) . egen allhigh = min(grade > 3 & grade < .), by(group)
Thanks to Tom Rogers for highlighting an incorrect detail in an earlier version.