Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: merge creates duplicates in master data
From
Michael Norman Mitchell <[email protected]>
To
[email protected]
Subject
Re: st: merge creates duplicates in master data
Date
Mon, 26 Apr 2010 22:33:21 -0700
Dear Will
I am delighted that this was able to help you out... I nearly did not
write this because I was afraid that I might be sending you on a wild
goose chase. I am glad that this was helpful.
I think there are two other tidbit I would add.
1. You might want to play with the -soundex()- function with respect
to the first and last names. This can be helpful in catching mispellings
and soundalike names, see http://www.stata.com/help.cgi?string+functions .
2. When I did this, I played with different matching criteria and
trying different orders of the matching criteria. I tried to find the
steps that led to the greatest number of high quality matches. But, in
the case where I was doing this, I had about 10 different potential
variables. It sounds like your set of matching variables is smaller.
Lastly, these kinds of matching problems are time consuming and
tricky, so it is to be expected that it can take some time to work it
through.
Best luck!
Michael N. Mitchell
See the Stata tidbit of the week at...
http://www.MichaelNormanMitchell.com
On 2010-04-26 12.50 PM, Will Hauser wrote:
Michael,
My thinking was that I was doing exactly what you were suggesting
since the master data is "inviolate." I assumed that once a match had
occurred it would not be overwritten and thus subsequent merges would
only apply to the unmatched cases. *However, I now see this is
incorrect.* The merge command does not alter the master data (update
and replace options aside) but it will duplicate cases as necessary to
make all possible matches between the using and master datasets. This
is the source of the confusion. The process would be simplified
considerably if the merge command allows an 'if' condition so I could
match cases only *if* _prior_merge==1.
Your suggestion did improve the procedure. After each match I saved
the matches in a separate dataset and then re-merged the remaining
cases. Then I just append it all back together at the end. I write
this as much to thank you as to clarify the situation for those who
stumble upon this in the future. If you have any other suggestions
for refining the process I'm all ears.
Thanks
William Hauser
Michael Norman Mitchell wrote:
Dear William
I have approached these kinds of problems in the past, but have
approached them in a different way with quite a bit of success.
Please take this for what it is worth, just a brainstorming idea or
an idea for a future approach. You may see it useful in your case,
maybe not.
Consider the two datasets, A and B that have the kind of
information that you are describing. They may match perfectly, they
may match to varying degrees of imperfect matches. I would set up a
series of match criteria, for example
1. first name, last name, middle initial, region
Matches at this level would be consider a "quality 1" match. If a
quality 1 match was not found, I would take the *unmatched
observations* from each dataset, and submit them to a second match
criteria, for example
2. first name, last name, region
Matches at this level would be considered a "quality 2" match. If a
quality 2 match was not found, I would take the *unmatched
observations* (neither matched at quality 1 or quality 2) and then
try a third round, for example
3. first initial, last name, region
Matches at this level would be considerd a "quality 3" match. If this
was the final match criteria, then I would consider the remaining
unmatched to be "not found" and would manually inspect them looking
for other ways that they could be matched. I would then append the
matched records from "round 1", "round 2" and "round 3" and those
would form the matched records.
I don't know if this strategy is exactly helpful in your case. If
not, I hope it is something that you (or other Statalisters) may find
useful in the future. In fact, I think I will put this on my list of
"to do" items for an upcoming Stata tidbit of the week.
Best luck and best regards,
Michael N. Mitchell
See the Stata tidbit of the week at...
http://www.MichaelNormanMitchell.com
On 2010-04-25 7.42 PM, Will Hauser wrote:
Hello all,
I am experiencing unexpected behavior in Stata 10 when using the
merge command.
I am matching two lists based on a series of string variables (first
name, last name, initials) and one numeric region identifier. I
have carefully cleaned the string variables of excess spaces and
punctuation marks but they are inherently difficult to match as the
name on one list may correspond to a nick name or abbreviation on
the other (e.g. "WILLIAM" may correspond with "W" or "BILL"). My
approach to this problem is to make multiple merges between the two
lists each time using less information. For example, the first
merge uses first name, last name, and region. The second uses first
initial, last name, and region. The third just last name and region
(and so on). Since the master data is inviolate subsequent
mismatches should never overwrite earlier 'good' matches. I am
using the update option but not the replace option. I am not using
the unique option since the variables do not uniquely identify the
cases in either the master or the using.
From what I can tell Stata is duplicating cases in the master
dataset. The end result is 10 pairs of duplicate entries that
appear identical in every way save for the _merge summary variable
from the last merge. The summary variable indicates using agrees
with master (3) for one of the duplicates and indicates that using
does not agree with master for the other (5). There are no missing
values in either list and I can see nothing special about the
entries that are duplicated. I have used the duplicates command to
verify that these duplicates are not present in the master data
prior to merging.
I assume this is not a bug but is rather something about the merge
command I am misunderstanding and that concerns me very much. I
would be happy to provide the lists and the relevant portion of the
do file if anyone is interested. The lists are public and are not
unusually long (958 cases in the master and 593 cases in the using).
Thanks for your insight,
William Hauser
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/