Title | Identifying and dropping duplicate observations | |
Author | William Gould, StataCorp |
Starting with Stata 8, the duplicates command provides a way to report on, give examples of, list, browse, tag, or drop duplicate observations. This FAQ is likely only of interest to users of previous versions of Stata.
You wish to create a new variable named dup
dup = 0 record is unique dup = 1 record is duplicate, first occurrence dup = 2 record is duplicate, second occurrence dup = 3 record is duplicate, third occurrence etc.
and to base the determination on the variables name, age, and sex.
. sort name age sex . quietly by name age sex: gen dup = cond(_N==1,0,_n)
Note the capitalization of _N and _n. (Stata interprets _N to mean the total number of observations in the by-group and _n to be the observation number within the by-group.)
Having created the new variable dup, you could then
. tabulate dup
to see a report of the duplicate count.
To base the duplicate count solely on name, type
. sort name . quietly by name: gen dup = cond(_N==1,0,_n)
To base the duplicate count on name, age, sex, and address, type
. sort name age sex address . quietly by name age sex address: gen dup = cond(_N==1,0,_n)
Picking up where case 1 left off, if you want to drop all duplicate observations but keep the first occurrence, type
. drop if dup>1
To drop all duplicate observations, including the first occurrence, type
. drop if dup>0
Obviously, one solution would be to treat this as case 1, and simply type the names of all the variables in your dataset. The following is a variation on that idea that avoids some typing:
. unab vlist : _all . sort `vlist' . quietly by `vlist': gen dup = cond(_N==1,0,_n)
The use of unab is a trick; it puts the names of every variable in the dataset in the macro vlist.
In the last two lines, note that you type `vlist', which is to say, left single quote, the word vlist, right single quote. Putting open and close quotes around a macro's name is equivalent to typing out the contents of the macro. Thus sort `vlist' is equivalent to typing out sort followed by the names of all the variables in the dataset.
Picking up where case 3 left off, this is no different from case 2. If you want to drop all duplicate observations but keep the first occurrence, type
. drop if dup>1
If you want to drop all duplicate observations, including the first occurrence, type
. drop if dup>0
We have the following data:
. list make price mpg 1. VW Diesel 5397 41 2. BMW 320i 9735 25 3. Datsun 510 5079 24 4. Audi 5000 9690 17 5. BMW 320i 9375 25 6. VW Diesel 5397 41 7. BMW 320i 9735 25
We wish to eliminate duplicate observations based on make:
. sort make . quietly by make: gen dup = cond(_N==1,0,_n)
The final step is drop if dup>1 but, before we do that, we will list the data so that you can see the values in new variable dup:
. list make price mpg dup 1. Audi 5000 9690 17 0 2. BMW 320i 9735 25 1 3. BMW 320i 9735 25 2 4. BMW 320i 9375 25 3 5. Datsun 510 5079 24 0 6. VW Diesel 5397 41 1 7. VW Diesel 5397 41 2 . drop if dup>1 (3 observations deleted) . list make price mpg dup 1. Audi 5000 9690 17 0 2. BMW 320i 9735 25 1 3. Datsun 510 5079 24 0 4. VW Diesel 5397 41 1
We start with the data used in example 1, but this time we drop duplicates based on all the variables:
. list make price mpg 1. VW Diesel 5397 41 2. BMW 320i 9735 25 3. Datsun 510 5079 24 4. Audi 5000 9690 17 5. BMW 320i 9375 25 6. VW Diesel 5397 41 7. BMW 320i 9735 25 . local 0 . unab vlist : _all . quietly by `vlist': gen dup = cond(_N==1,0,_n) . drop if dup>1 (2 observations deleted) . list make price mpg dup 1. Audi 5000 9690 17 0 2. BMW 320i 9375 25 0 3. BMW 320i 9735 25 1 4. Datsun 510 5079 24 0 5. VW Diesel 5397 41 1
We obtained a different result: This time, the BMW appears twice in our data, once with a price of 9375 and once with price 9735. Presumably, one of those prices contains a transcription error. When we based the duplicate identification solely on the contents of variable make, we obtained one of the BMW observations at random.
An alternative method for identifying duplicates is discussed in the Technical Note in [D] egen.