Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: merge creates duplicates in master data

From	Michael Norman Mitchell <[email protected]>
To	[email protected]
Subject	Re: st: merge creates duplicates in master data
Date	Mon, 26 Apr 2010 22:33:21 -0700

Dear Will

I am delighted that this was able to help you out... I nearly did notwrite this because I was afraid that I might be sending you on a wildgoose chase. I am glad that this was helpful.


  I think there are two other tidbit I would add.

1. You might want to play with the -soundex()- function with respectto the first and last names. This can be helpful in catching mispellingsand soundalike names, see http://www.stata.com/help.cgi?string+functions .

2. When I did this, I played with different matching criteria andtrying different orders of the matching criteria. I tried to find thesteps that led to the greatest number of high quality matches. But, inthe case where I was doing this, I had about 10 different potentialvariables. It sounds like your set of matching variables is smaller.

Lastly, these kinds of matching problems are time consuming andtricky, so it is to be expected that it can take some time to work itthrough.


Best luck!

Michael N. Mitchell
See the Stata tidbit of the week at...
http://www.MichaelNormanMitchell.com

On 2010-04-26 12.50 PM, Will Hauser wrote:

Michael,
My thinking was that I was doing exactly what you were suggestingsince the master data is "inviolate." I assumed that once a match hadoccurred it would not be overwritten and thus subsequent merges wouldonly apply to the unmatched cases. *However, I now see this isincorrect.* The merge command does not alter the master data (updateand replace options aside) but it will duplicate cases as necessary tomake all possible matches between the using and master datasets. Thisis the source of the confusion. The process would be simplifiedconsiderably if the merge command allows an 'if' condition so I couldmatch cases only *if* _prior_merge==1.
Your suggestion did improve the procedure. After each match I savedthe matches in a separate dataset and then re-merged the remainingcases. Then I just append it all back together at the end. I writethis as much to thank you as to clarify the situation for those whostumble upon this in the future. If you have any other suggestionsfor refining the process I'm all ears.
Thanks

William Hauser


Michael Norman Mitchell wrote:
Dear William
I have approached these kinds of problems in the past, but haveapproached them in a different way with quite a bit of success.Please take this for what it is worth, just a brainstorming idea oran idea for a future approach. You may see it useful in your case,maybe not.
Consider the two datasets, A and B that have the kind ofinformation that you are describing. They may match perfectly, theymay match to varying degrees of imperfect matches. I would set up aseries of match criteria, for example
  1. first name, last name, middle initial, region
Matches at this level would be consider a "quality 1" match. If aquality 1 match was not found, I would take the *unmatchedobservations* from each dataset, and submit them to a second matchcriteria, for example
  2. first name, last name, region
Matches at this level would be considered a "quality 2" match. If aquality 2 match was not found, I would take the *unmatchedobservations* (neither matched at quality 1 or quality 2) and thentry a third round, for example
  3. first initial, last name, region
Matches at this level would be considerd a "quality 3" match. If thiswas the final match criteria, then I would consider the remainingunmatched to be "not found" and would manually inspect them lookingfor other ways that they could be matched. I would then append thematched records from "round 1", "round 2" and "round 3" and thosewould form the matched records.
I don't know if this strategy is exactly helpful in your case. Ifnot, I hope it is something that you (or other Statalisters) may finduseful in the future. In fact, I think I will put this on my list of"to do" items for an upcoming Stata tidbit of the week.
Best luck and best regards,

Michael N. Mitchell
See the Stata tidbit of the week at...
http://www.MichaelNormanMitchell.com

On 2010-04-25 7.42 PM, Will Hauser wrote:
Hello all,
I am experiencing unexpected behavior in Stata 10 when using themerge command.I am matching two lists based on a series of string variables (firstname, last name, initials) and one numeric region identifier. Ihave carefully cleaned the string variables of excess spaces andpunctuation marks but they are inherently difficult to match as thename on one list may correspond to a nick name or abbreviation onthe other (e.g. "WILLIAM" may correspond with "W" or "BILL"). Myapproach to this problem is to make multiple merges between the twolists each time using less information. For example, the firstmerge uses first name, last name, and region. The second uses firstinitial, last name, and region. The third just last name and region(and so on). Since the master data is inviolate subsequentmismatches should never overwrite earlier 'good' matches. I amusing the update option but not the replace option. I am not usingthe unique option since the variables do not uniquely identify thecases in either the master or the using.
From what I can tell Stata is duplicating cases in the masterdataset. The end result is 10 pairs of duplicate entries thatappear identical in every way save for the _merge summary variablefrom the last merge. The summary variable indicates using agreeswith master (3) for one of the duplicates and indicates that usingdoes not agree with master for the other (5). There are no missingvalues in either list and I can see nothing special about theentries that are duplicated. I have used the duplicates command toverify that these duplicates are not present in the master dataprior to merging.
I assume this is not a bug but is rather something about the mergecommand I am misunderstanding and that concerns me very much. Iwould be happy to provide the lists and the relevant portion of thedo file if anyone is interested. The lists are public and are notunusually long (958 cases in the master and 593 cases in the using).
Thanks for your insight,

William Hauser
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: merge creates duplicates in master data
  - From: Will Hauser <[email protected]>
- Re: st: merge creates duplicates in master data
  - From: Michael Norman Mitchell <[email protected]>
- Re: st: merge creates duplicates in master data
  - From: Will Hauser <[email protected]>

Prev by Date: st: Guidance on matrix inversion for OLS in mata
Next by Date: RE: Re: st: RE: AW: ratio function
Previous by thread: Re: st: merge creates duplicates in master data
Next by thread: st: Cannot understand errors returned from -xtabond2- (from example in SJ titled "How to do xtabond2")
Index(es):
- Date
- Thread