Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: merge creates duplicates in master data
From
Will Hauser <[email protected]>
To
[email protected]
Subject
st: merge creates duplicates in master data
Date
Sun, 25 Apr 2010 22:42:35 -0400
Hello all,
I am experiencing unexpected behavior in Stata 10 when using the merge
command.
I am matching two lists based on a series of string variables (first
name, last name, initials) and one numeric region identifier. I have
carefully cleaned the string variables of excess spaces and punctuation
marks but they are inherently difficult to match as the name on one list
may correspond to a nick name or abbreviation on the other (e.g.
"WILLIAM" may correspond with "W" or "BILL"). My approach to this
problem is to make multiple merges between the two lists each time using
less information. For example, the first merge uses first name, last
name, and region. The second uses first initial, last name, and
region. The third just last name and region (and so on). Since the
master data is inviolate subsequent mismatches should never overwrite
earlier 'good' matches. I am using the update option but not the
replace option. I am not using the unique option since the variables do
not uniquely identify the cases in either the master or the using.
From what I can tell Stata is duplicating cases in the master dataset.
The end result is 10 pairs of duplicate entries that appear identical in
every way save for the _merge summary variable from the last merge. The
summary variable indicates using agrees with master (3) for one of the
duplicates and indicates that using does not agree with master for the
other (5). There are no missing values in either list and I can see
nothing special about the entries that are duplicated. I have used the
duplicates command to verify that these duplicates are not present in
the master data prior to merging.
I assume this is not a bug but is rather something about the merge
command I am misunderstanding and that concerns me very much. I would
be happy to provide the lists and the relevant portion of the do file if
anyone is interested. The lists are public and are not unusually long
(958 cases in the master and 593 cases in the using).
Thanks for your insight,
William Hauser
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/