Title | Match merging when there are duplicate IDs | |
Author | William Gould, StataCorp |
Your problem is most likely caused by having duplicate IDs. Duplicate IDs can cause unexpected results when doing a match merge. Consider the following examples:
The master dataset has 5 observations, and the using dataset has 8 observations. When you do the merge, every observation has a _merge code of 2 or 3 (every observation in the master dataset was matched), yet the merged dataset contains 9 observations.
Cause: Duplicate observations in the smaller dataset (and perhaps in the larger one, too).
. use junk2 [this is the "using" dataset] . list +--------+ | id y | |--------| 1. | 1 1 | 2. | 1 2 | 3. | 1 3 | 4. | 2 1 | 5. | 3 1 | |--------| 6. | 3 2 | 7. | 4 1 | 8. | 5 1 | +--------+ . use junk1, clear [this is the master dataset] . sort id x . list +--------+ | id x | |--------| 1. | 1 1 | 2. | 1 2 | 3. | 2 1 | 4. | 2 2 | 5. | 3 1 | +--------+ . merge id using junk2 variable id does not uniquely identify observations in the master data variable id does not uniquely identify observations in junk2.dta . list +---------------------+ | id x y _merge | |---------------------| 1. | 1 1 1 3 | 2. | 1 2 2 3 | 3. | 2 1 1 3 | 4. | 2 2 1 3 | 5. | 3 1 1 3 | |---------------------| 6. | 1 2 3 3 | 7. | 3 1 2 3 | 8. | 4 . 1 2 | 9. | 5 . 1 2 | +---------------------+
Examine x and y, and you will see how merge matches up duplicates.
There are only 3 observations in your master dataset, yet, when you do the merge, there are 4 observations that have a _merge code of 3 (meaning the observations are in both datasets).
Cause: There are duplicates in the using dataset.
. use junk2, clear [this is the "using" dataset] . list +--------+ | id y | |--------| 1. | 1 1 | 2. | 1 2 | 3. | 1 3 | 4. | 2 1 | 5. | 3 1 | |--------| 6. | 3 2 | 7. | 4 1 | 8. | 5 1 | +--------+ . use junk1, clear [this is the master dataset] . sort id x . list +--------+ | id x | |--------| 1. | 1 1 | 2. | 2 2 | 3. | 3 3 | +--------+ . merge id using junk2 variable id does not uniquely identify observations in junk2.dta . list +---------------------+ | id x y _merge | |---------------------| 1. | 1 1 1 3 | 2. | 2 2 1 3 | 3. | 3 3 1 3 | 4. | 1 1 2 3 | 5. | 1 1 3 3 | |---------------------| 6. | 3 3 2 3 | 7. | 4 . 1 2 | 8. | 5 . 1 2 | +---------------------+
The duplicates and isid commands help identify and deal with duplicate observations.