Stata
Products Purchase Support Company
Search
   >> Home >> Resources & support >> FAQs >> Match merging when there are duplicate IDs Bookmark and Share

Why does my merge produce a dataset with too many observations?

Title   Match merging when there are duplicate IDs
Author William Gould, StataCorp
Date January 1996; minor update August 2005

Your problem is most likely caused by having duplicate IDs. Duplicate IDs when doing a match merge can cause unexpected results. Consider the following examples:

Example 1. There are too many observations in the merged dataset

The master dataset has 5 observations, and the using dataset has 8 observations. When you do the merge, every observation has a _merge code of 2 or 3 (every observation in the master dataset was matched), yet the merged dataset contains 9 observations.

Cause: Duplicate observations in the smaller dataset (and perhaps in the larger one, too).

. use junk2  [this is the "using" dataset]
        
. list

     +--------+
     | id   y |
     |--------|
  1. |  1   1 |
  2. |  1   2 |
  3. |  1   3 |
  4. |  2   1 |
  5. |  3   1 |
     |--------|
  6. |  3   2 |
  7. |  4   1 |
  8. |  5   1 |
     +--------+

. use junk1, clear  [this is the master dataset]
        
. sort id x

. list

     +--------+
     | id   x |
     |--------|
  1. |  1   1 |
  2. |  1   2 |
  3. |  2   1 |
  4. |  2   2 |
  5. |  3   1 |
     +--------+

. merge id using junk2
variable id does not uniquely identify observations in the master data
variable id does not uniquely identify observations in junk2.dta

. list

     +---------------------+
     | id   x   y   _merge |
     |---------------------|
  1. |  1   1   1        3 |
  2. |  1   2   2        3 |
  3. |  2   1   1        3 |
  4. |  2   2   1        3 |
  5. |  3   1   1        3 |
     |---------------------|
  6. |  1   2   3        3 |
  7. |  3   1   2        3 |
  8. |  4   .   1        2 |
  9. |  5   .   1        2 |
     +---------------------+

Examine x and y, and you will see how merge matches up duplicates.

Example 2. More matches than observations in the smallest dataset

There are only 3 observations in your master dataset, yet when you do the merge, there are 4 observations that have a _merge code of 3 (meaning the observations are in both datasets).

Cause: There are duplicates in the using dataset.

. use junk2, clear  [this is the "using" dataset]
        
. list
        
     +--------+
     | id   y |
     |--------|
  1. |  1   1 |
  2. |  1   2 |
  3. |  1   3 |
  4. |  2   1 |
  5. |  3   1 |
     |--------|
  6. |  3   2 |
  7. |  4   1 |
  8. |  5   1 |
     +--------+
          
. use junk1, clear  [this is the master dataset]

. sort id x
        
. list

     +--------+
     | id   x |
     |--------|
  1. |  1   1 |
  2. |  2   2 |
  3. |  3   3 |
     +--------+

. merge id using junk2 
variable id does not uniquely identify observations in junk2.dta

.  list

     +---------------------+
     | id   x   y   _merge |
     |---------------------|
  1. |  1   1   1        3 |
  2. |  2   2   1        3 |
  3. |  3   3   1        3 |
  4. |  1   1   2        3 |
  5. |  1   1   3        3 |
     |---------------------|
  6. |  3   3   2        3 |
  7. |  4   .   1        2 |
  8. |  5   .   1        2 |
     +---------------------+

Finding duplicate IDs

The duplicates and isid commands help identify and deal with duplicate observations.
FAQs
What's new?
Statistics
Data management
Graphics
Programming Stata
Mata
Resources
Internet capabilities
Stata for Windows
Stata for Unix
Stata for Mac
Technical support
Resources & support
FAQs
Technical support
NetCourses
Short courses
Users Group meetings
Statalist
Links
Software updates
Software archives
Customer service
Manuals & supplements
Stata Journal
STB
Stata News
Stata Automation
Plugins

Site overview
Products
Resources & support
Company
Site index

© Copyright 1996–2009 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index