Title | Selecting a subset of observations with a complicated criterion | |
Author | Nicholas J. Cox, Durham University, UK |
I have a dataset, and I wish to work with a subset of observations, and that subset is defined by a complicated criterion. (This might be a long list of identifiers or some other codes specifying which observations belong in the subset.) What is the easiest way to do this?
Before starting to answer, let us indicate just two situations in which this question might arise. You might wish to work with a smaller dataset that is defined by either this criterion or its complement. Or, you might wish to define an indicator or dummy variable (say, indicating firms subject to some particular regulation, which fall only into certain industries). Our example code focuses on the first, but, essentially, the same basic idea applies to the second.
You already know one solution: using a complicated if condition. It is just that you really would rather not type out some long line like
. keep if id == 12 | id == 23 | id == 34 | id == 45 | and so on, and so on
In practice, what you type should never be as long as this example implies. You can cut down typing substantially by using functions such as inlist() and inrange().
(We will mention just once that it may be easier to use drop on the complementary subset, rather than using keep on the subset you want, a point that applies throughout this FAQ.)
Repeated typing of various syntax elements is part of what makes this approach difficult. Questions like this arise frequently, so we need other methods. There is another way to approach selection whenever equality with any of several integer values is the criterion.
. egen OK = anymatch(id), values(12 23 34 45 and so on) . keep if OK
The first statement uses the egen command. Read this as generate the new variable OK that is 1 (true) if id is equal to any of the values specified and 0 otherwise. anymatch() in Stata 9 and later releases is a replacement for eqany() in Stata 8 and prior releases. Crucially, the argument of values() may be a numlist, so, to give only one example, unbroken sequences of integers may be specified concisely. This function is similar to using inlist() or inrange() with if, as mentioned above.
A third way uses merge. You may need to get around a mental block that merge is a command that produces larger datasets; i.e., "this dataset" plus "that dataset". It does precisely that, but, in the problems discussed here, the useful product is the intersection, not the union.
To make matters concrete, let us suppose that main.dta contains observations and an identifier variable id, and we wish to select observations for some of those identifiers.
. merge 1:m id using main
. keep if _merge == 3 . drop _merge
This method is free of any limits imposed by restrictions on how long a command line (section 1) or an option argument (section 2) may be.
Suppose you have numeric identifiers given by ranges like 1/2 34/56 678/901 or, more generally specifiable, as a numlist. Clearly, you would not want to type in a dataset containing all the individual identifiers. Here is an alternative:
. clear . numlist "1/2 34/56 678/901" . tokenize `r(numlist)' . local N : word count `r(numlist)' . set obs `N' . gen id = . . forval i = 1 / `N' { . qui replace id = ``i'' in `i' . }
In other words, the numlist command expands the abbreviated numlist into its individual elements 1 2 34 35 ... 900 901, tokenize puts those individual elements into local macros, and the other commands then put them into values of the variable id. This shortcut should work for up to 2,500 elements. See help limits.
See also the FAQ How do you efficiently define group characteristics in your data in order to create subsets?, which was written by Kit Baum. Kit provided helpful comments for this FAQ as well.