Stata | FAQ: Listing observations in a group that differ on a variable

Home / Resources & support / FAQs / Listing observations in a group that differ on a variable

The following material is based on a question and answer that appeared on Statalist.

How do I list observations in a group that differ on a variable?

Title		Listing observations in a group that differ on a variable
Author		Nicholas J. Cox, Durham University, UK

The problem

I have data on various individuals with genotypes ascertained from samples taken at different times. I want to list only those samples with differing genotypes for each individual.

The data are

 eid     egenotype  
 0       vv         
 0       vv         
 1       vv         
 1       ww         
 2       ww         
 2       vv         
 2       ww

The solution

The question does not specify whether egenotype is a string variable or a numeric variable with labels. The solution here applies to both and also to numeric variables without labels. First, we sort the data on eid and then on egenotype:

        . sort eid egenotype

If all the values of egenotype are the same for each eid, then, after sorting the first value within each, eid will equal the last. If there is any variation within eid, this will not be true. This will work irrespective of the number of observations for each eid, the number of egenotypes, and the type of variable used. Thus, for eid 0, the first value vv will equal the last, but, for eid 1 and 2, the first and last values will differ. The example of eid 2 also shows why sorting is essential, as at present the first and third values are both ww, but the middle value is vv.

Accordingly, we work out which groups have different values and then list those groups only:

        . by eid (egenotype), sort: gen diff = egenotype[1] != egenotype[_N] 
        . list eid egenotype if diff

The by ..., sort combines sort eid egenotype with an ensuing by eid: generate statement. Under the protection of by:, subscripts apply to observations within each group. Thus [1] denotes the first observation, and [_N] denotes the last observation within each group. If the corresponding values differ, diff will be 1, and, if they are the same, diff will be 0. (For more information on this, see FAQ: What is true and false in Stata? .) Then the list is restricted to values that are different.

How would this be extended to identifying groups that differ on at least one of two or more variables? One way would be to use egen. For example, egen, group() could be used to group values according to one or more variables, and then the same method could be used on the resulting variable.

The opposite problem: observations with the same values

It should be clear that the opposite problem, finding observations with the same values, has an essentially similar solution. We could negate the variable diff above, which would exchange 0s and 1s. Or, starting from scratch, we could just change the operator from != to ==.

. by eid (egenotype), sort: gen same = egenotype[1] == egenotype[_N]
. list eid egenotype if same

Careful sorting remains essential here. If all the values in a group are identical, then the first and last values will necessarily be the same, but the converse does not always follow. The first and last of a group with two or more distinct values could be identical as a matter of accident in an unsorted group. So we need sorting within a group to shake different values apart.

We use cookies

We use cookies to ensure that we give you the best experience on our website—to enhance site navigation, to analyze usage, and to assist in our marketing efforts. By continuing to use our site, you consent to the storing of cookies on your device and agree to delivery of content, including web fonts and JavaScript, from third party web services.

Cookie Settings

Last updated: 16 November 2022

StataCorp LLC (StataCorp) strives to provide our users with exceptional products and services. To do so, we must collect personal information from you. This information is necessary to conduct business with our existing and potential customers. We collect and use this information only where we may legally do so. This policy explains what personal information we collect, how we use it, and what rights you have to that information.

Advertising and performance cookies

This website uses cookies to provide you with a better user experience. A cookie is a small piece of data our website stores on a site visitor's hard drive and accesses each time you visit so we can improve your access to our site, better understand how you use our site, and serve you content that may be of interest to you. For instance, we store a cookie when you log in to our shopping cart so that we can maintain your shopping cart should you not complete checkout. These cookies do not directly store your personal information, but they do support the ability to uniquely identify your internet browser and device.

Please note: Clearing your browser cookies at any time will undo preferences saved here. The option selected here will apply only to the device you are currently using.

How do I list observations in a group that differ on a variable?

The problem

The solution

The opposite problem: observations with the same values

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies

Stata/MP4 Annual License (download)

How do I list observations in a group that differ on a variable?

The problem

The solution

The opposite problem: observations with the same values

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies