Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: merge creates duplicates in master data
From 
 
Michael Norman Mitchell <[email protected]> 
To 
 
[email protected] 
Subject 
 
Re: st: merge creates duplicates in master data 
Date 
 
Mon, 26 Apr 2010 09:49:47 -0700 
Dear William
  I have approached these kinds of problems in the past, but have 
approached them in a different way with quite a bit of success. Please 
take this for what it is worth, just a brainstorming idea or an idea for 
a future approach. You may see it useful in your case, maybe not.
  Consider the two datasets, A and B that have the kind of information 
that you are describing. They may match perfectly, they may match to 
varying degrees of imperfect matches. I would set up a series of match 
criteria, for example
  1. first name, last name, middle initial, region
Matches at this level would be consider a "quality 1" match. If a 
quality 1 match was not found, I would take the *unmatched observations* 
from each dataset, and submit them to a second match criteria, for example
  2. first name, last name, region
Matches at this level would be considered a "quality 2" match. If a 
quality 2 match was not found, I would take the *unmatched observations* 
(neither matched at quality 1 or quality 2) and then try a third round, 
for example
  3. first initial, last name, region
Matches at this level would be considerd a "quality 3" match. If this 
was the final match criteria, then I would consider the remaining 
unmatched to be "not found" and would manually inspect them looking for 
other ways that they could be matched. I would then append the matched 
records from "round 1", "round 2" and "round 3" and those would form the 
matched records.
  I don't know if this strategy is exactly helpful in your case. If 
not, I hope it is something that you (or other Statalisters) may find 
useful in the future. In fact, I think I will put this on my list of "to 
do" items for an upcoming Stata tidbit of the week.
Best luck and best regards,
Michael N. Mitchell
See the Stata tidbit of the week at...
http://www.MichaelNormanMitchell.com
On 2010-04-25 7.42 PM, Will Hauser wrote:
Hello all,
I am experiencing unexpected behavior in Stata 10 when using the merge 
command.
I am matching two lists based on a series of string variables (first 
name, last name, initials) and one numeric region identifier.  I have 
carefully cleaned the string variables of excess spaces and 
punctuation marks but they are inherently difficult to match as the 
name on one list may correspond to a nick name or abbreviation on the 
other (e.g. "WILLIAM" may correspond with "W" or "BILL").  My approach 
to this problem is to make multiple merges between the two lists each 
time using less information.  For example, the first merge uses first 
name, last name, and region.  The second uses first initial, last 
name, and region.  The third just last name and region (and so on).  
Since the master data is inviolate subsequent mismatches should never 
overwrite earlier 'good' matches.  I am using the update option but 
not the replace option.  I am not using the unique option since the 
variables do not uniquely identify the cases in either the master or 
the using.
From what I can tell Stata is duplicating cases in the master 
dataset.  The end result is 10 pairs of duplicate entries that appear 
identical in every way save for the _merge summary variable from the 
last merge.  The summary variable indicates using agrees with master 
(3) for one of the duplicates and indicates that using does not agree 
with master for the other (5).  There are no missing values in either 
list and I can see nothing special about the entries that are 
duplicated.  I have used the duplicates command to verify that these 
duplicates are not present in the master data prior to merging.
I assume this is not a bug but is rather something about the merge 
command I am misunderstanding and that concerns me very much.  I would 
be happy to provide the lists and the relevant portion of the do file 
if anyone is interested.  The lists are public and are not unusually 
long (958 cases in the master and 593 cases in the using).
Thanks for your insight,
William Hauser
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/