Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: merging datasets
From
Will Hauser <[email protected]>
To
[email protected]
Subject
Re: st: merging datasets
Date
Wed, 10 Nov 2010 22:38:05 -0500
Mike,
I have worked extensively with matching string and numeric variables
across several data sets. It is a pain. Michael Mitchell has given you
good advice I would build on it by noting the following:
(sorry, this will be lengthy)
Let me summarize by saying - do NOT use the update option, watch for
duplicates, and remove matches from each step.
1. With large data sets such as yours (and mine) the joinby command
creates a mess. I don't trust myself to sort it out with logical
commands. The best approach is to first make matches that are sure
things (i.e. matches on all variables or those that really matter -
first name, last name, and birthday). Hopefully the number of matches
remaining that were not 'perfect' is a manageable number so that you can
do some hand checking with the help of some dummy variables to identify
what info does not match.
2. You should know the merge command, at least in stata 10, will
duplicate entries in the master file if there are multiple matches in
the using data set that match the case in the master according to the
match criteria you set. Stata does this *even though* the master file
is "held inviolate." So, if you match based on last name only you'll
wind up with several duplicate entries for "Smith" where each Smith in
the master is match to every other "Smith" in the using data set.
You'll want a unique identifier for each case in the master so you can
see what names end up being duplicated (command: duplicates list
"idvar"). Also do a count before the merge and make sure you end up
back at that number at the end of the process.
3. Do the matching process across multiple 'passes'. First pass uses
the most match variables and last pass uses the least. Choose the match
variables carefully so that earlier passes are more trustworthy than
later passes (e.g. the pass based on last name and birthday should
probably come before the pass based on first initial and last name).
When you have a match, remove it from the master (and in your case
probably the using file as well). This is because subsequent matches
will use less info and there is no reason to think that they will be a
better match than a match from an earlier pass where more of the
variables matched. When you are done with each merge just save the file
to a temporary file name, keep only the matches, and then save the file
as "first pass matches" or some equivalent. Then re-open the temp file,
keep only the unmatched (i.e. _merge==1) and save as the "master data"
or some equivalent. When you are done just append all the saved matches
data sets together.
*To keep things orderly I like to include the word "master" in the
filename for the master dataset and "using" for the filename for the
using dataset. The syntax can become cumbersome and it's nice to
quickly see that I'm using the right datasets in the right places.
4. Clean your string variables carefully and thoroughly. Look out for
individuals with hyphenated names, capitalizations (i.e. McCory), or
spaces (i.e. St Joseph). Be sure suffixes aren't contaminating the last
name field. Look out for extra spaces (leading, trailing, or internal)
and make the capitalization uniform (all caps or all lowercase). Use
the string functions, in particular regexm, trim, itrim, and the user
written code 'strip'. You can't clean your string variables enough!
5. In my experience first names tend to vary - one database may have
nicknames while the other may have formal names. Resolve this with a
variable for first initial. First match using the full first name along
with the other variables then the next pass should use the first initial
instead along with the other variables. You may want to do this with
birth dates as well - first use the actual date and then just use the
year (or month and year). Obviously make sure you convert dates from
string format into stata's numeric date format.
6. Name and label your _merge variable from each pass so you know where
the match came from, how trustworthy it is, what variables would've
matched, etc. These are extremely handy when hand checking or writing
logical statements to drop bad matches.
7. With a large data set the merge variables probably won't uniquely
identify the cases. This same problem applies to the sort command. For
this reason when you sort your data sets use the 'stable' option! If you
are like me and like to be able to go back and re-run the do file to
make changes or corrections in light of errors you find later on you may
have a real mess on you hands if you don't use the stable option (I
found out the hard way). Have a unique sort order at the start (i.e.
sort by all the variables) and then always specify the stable option
with subsequent sorts. I use numeric ids to refer to each case (i.e.
when dropping mismatches) and these ids would change each time I ran the
do file because the order of the dataset was not constant when I created
the ids after the match process.
8. Keep track of duplicates in the master dataset. You can do this with
the egen 'group' command. Generate groups for each set of variables that
you match on. For any group consisting of more then one tag it as
duplicated on varlist (your march variables) using a dummy variable.
This is because some matches are not necessarily bad matches but they
may be uncertain matches. In short, they may match based on the
variables you have data for but, because there are duplicates on those
variables, you can't say for sure if the match is correct or not. For
example if you have 2 John Smiths who are both missing a birthday and
you have 2 John Smiths in the using data set who each have a birthday
then a merge will link these up (and generate 2 duplicates) but in the
end you have to throw these matches out because you don't know which
case in the master goes with which case in the using. Having a variable
that identifies cases that have duplicates based on first and last name
will help you write a logical command to drop these uncertain matches.
9. Last and finally, do NOT use the update command! Let's say you have
a Franklin Smith in the master file, he is missing his birthday. Now
you match based on last name and first initial and get a match for F
Smith in the using data set (he has a birthday listed). The data in the
master file is not overwritten so only the missing data is carried over
into the master and your Franklin Smith now has a birthday. Sounds
great but what you cannot see is that Franklin "F Smith" matched to
Frederick "F Smith" in the using file - almost certainly not the same
character but you cannot discern this info from the mater dataset. For
this reason **the update command is a baaaad idea unless you are 100%
confident the match is and can only be correct.**
When a variable in the master has missing data you can get that data
into the master from the using by opening the using data set before the
merge, and duplicating that variable with a slightly different name
(e.g. varname_using). Then, once you've eliminated bad matches you can
just copy that data from varname_using to varname if varname=="".
***Because the merge will bring varname_using into the master file make
sure you drop it for the *unmatched cases* before the next match (you
can do this when you open the master to sort it before the next match).
One more tip: depending on the complexity of the task, the directory
structure, and your cleverness you may be able to use a foreach loop to
do the merges.
If you have additional questions about anything I've written here please
feel free to contact me personally. I'm not a stata guru but have used
it extensively for precisely the sort of task that you are undertaking
with some very large data sets (e.g. voter rolls). If you'd like to
see some of my do files I've used for this purpose I'd be happy to
share. Caveat emptor.
Will
On 11/8/2010 11:50 PM, Michael Eisenberg wrote:
Colleagues,
I have a database of about 20K men that I'd like to merge with another
database. I have names (first, middle, and last) as well as date of
birth and social security number for most men. Unfortunately, the
original database has some missing data on birthdate and social
security numbers. The new database has most of the birthdate info as
well as the geographic information that I need.
Some men do have the same name.
Is there anyway to merge based on name if it doesn't uniquely identify
men? I'd like to somehow match all men and then let me manually
compare based on visit dates to decide if it's likely the match is
correct. If not, any suggestions?
Thanks for you help.
Mike
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/