Stata | FAQ: Selecting a subset of observations with a complicated criterion

Home / Resources & support / FAQs / Selecting a subset of observations with a complicated criterion

The following material is based on postings to Statalist.

How do I select a subset of observations using a complicated criterion?

Title		Selecting a subset of observations with a complicated criterion
Author		Nicholas J. Cox, Durham University, UK

Question

I have a dataset, and I wish to work with a subset of observations, and that subset is defined by a complicated criterion. (This might be a long list of identifiers or some other codes specifying which observations belong in the subset.) What is the easiest way to do this?

Answer

Before starting to answer, let us indicate just two situations in which this question might arise. You might wish to work with a smaller dataset that is defined by either this criterion or its complement. Or, you might wish to define an indicator or dummy variable (say, indicating firms subject to some particular regulation, which fall only into certain industries). Our example code focuses on the first, but, essentially, the same basic idea applies to the second.

1 The obvious but tedious way

You already know one solution: using a complicated if condition. It is just that you really would rather not type out some long line like

        . keep if id == 12 | id == 23 | id == 34 | id == 45 | and so on, and so on

In practice, what you type should never be as long as this example implies. You can cut down typing substantially by using functions such as inlist() and inrange().

(We will mention just once that it may be easier to use drop on the complementary subset, rather than using keep on the subset you want, a point that applies throughout this FAQ.)

2 A less general, but sometimes useful, way

Repeated typing of various syntax elements is part of what makes this approach difficult. Questions like this arise frequently, so we need other methods. There is another way to approach selection whenever equality with any of several integer values is the criterion.

        . egen OK = anymatch(id), values(12 23 34 45 and so on)
        . keep if OK

The first statement uses the egen command. Read this as generate the new variable OK that is 1 (true) if id is equal to any of the values specified and 0 otherwise. anymatch() in Stata 9 and later releases is a replacement for eqany() in Stata 8 and prior releases. Crucially, the argument of values() may be a numlist, so, to give only one example, unbroken sequences of integers may be specified concisely. This function is similar to using inlist() or inrange() with if, as mentioned above.

3 Another way, using merge

A third way uses merge. You may need to get around a mental block that merge is a command that produces larger datasets; i.e., "this dataset" plus "that dataset". It does precisely that, but, in the problems discussed here, the useful product is the intersection, not the union.

To make matters concrete, let us suppose that main.dta contains observations and an identifier variable id, and we wish to select observations for some of those identifiers.

save main.dta, sorted on id, if you have not already done so.
clear those data, and create a dataset in Stata containing only the identifiers you want, using the same variable name id, with the same variable type as in main.dta, and sorted on id.
Now type
```
        . merge 1:m id using main
```
Observations with values for _merge of 3 are those which you want; that is, they form the overlap or intersection of the two datasets. Hence,
```
        . keep if _merge == 3
        . drop _merge
```

This method is free of any limits imposed by restrictions on how long a command line (section 1) or an option argument (section 2) may be.

4 A shortcut with the above when identifiers are numeric

Suppose you have numeric identifiers given by ranges like 1/2 34/56 678/901 or, more generally specifiable, as a numlist. Clearly, you would not want to type in a dataset containing all the individual identifiers. Here is an alternative:

        . clear                
        . numlist "1/2 34/56 678/901"
        . tokenize `r(numlist)' 
        . local N : word count `r(numlist)' 
        . set obs `N' 
        . gen id = . 
        . forval i = 1 / `N' { 
        .	qui replace id = ``i'' in `i'
        . }

In other words, the numlist command expands the abbreviated numlist into its individual elements 1 2 34 35 ... 900 901, tokenize puts those individual elements into local macros, and the other commands then put them into values of the variable id. This shortcut should work for up to 2,500 elements. See help limits.

5 A related FAQ

See also the FAQ How do you efficiently define group characteristics in your data in order to create subsets?, which was written by Kit Baum. Kit provided helpful comments for this FAQ as well.

We use cookies

We use cookies to ensure that we give you the best experience on our website—to enhance site navigation, to analyze usage, and to assist in our marketing efforts. By continuing to use our site, you consent to the storing of cookies on your device and agree to delivery of content, including web fonts and JavaScript, from third party web services.

Cookie Settings

Last updated: 16 November 2022

StataCorp LLC (StataCorp) strives to provide our users with exceptional products and services. To do so, we must collect personal information from you. This information is necessary to conduct business with our existing and potential customers. We collect and use this information only where we may legally do so. This policy explains what personal information we collect, how we use it, and what rights you have to that information.

Advertising and performance cookies

This website uses cookies to provide you with a better user experience. A cookie is a small piece of data our website stores on a site visitor's hard drive and accesses each time you visit so we can improve your access to our site, better understand how you use our site, and serve you content that may be of interest to you. For instance, we store a cookie when you log in to our shopping cart so that we can maintain your shopping cart should you not complete checkout. These cookies do not directly store your personal information, but they do support the ability to uniquely identify your internet browser and device.

Please note: Clearing your browser cookies at any time will undo preferences saved here. The option selected here will apply only to the device you are currently using.

How do I select a subset of observations using a complicated criterion?

Question

Answer

1 The obvious but tedious way

2 A less general, but sometimes useful, way

3 Another way, using merge

4 A shortcut with the above when identifiers are numeric

5 A related FAQ

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies

Stata/MP4 Annual License (download)

How do I select a subset of observations using a complicated criterion?

Question

Answer

1 The obvious but tedious way

2 A less general, but sometimes useful, way

3 Another way, using merge

4 A shortcut with the above when identifiers are numeric

5 A related FAQ

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies