Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: merging datasets

From	Will Hauser <[email protected]>
To	[email protected]
Subject	Re: st: merging datasets
Date	Wed, 10 Nov 2010 22:38:05 -0500

Mike,

I have worked extensively with matching string and numeric variablesacross several data sets. It is a pain. Michael Mitchell has given yougood advice I would build on it by noting the following:

(sorry, this will be lengthy)

Let me summarize by saying - do NOT use the update option, watch forduplicates, and remove matches from each step.

1. With large data sets such as yours (and mine) the joinby commandcreates a mess. I don't trust myself to sort it out with logicalcommands. The best approach is to first make matches that are surethings (i.e. matches on all variables or those that really matter -first name, last name, and birthday). Hopefully the number of matchesremaining that were not 'perfect' is a manageable number so that you cando some hand checking with the help of some dummy variables to identifywhat info does not match.

2. You should know the merge command, at least in stata 10, willduplicate entries in the master file if there are multiple matches inthe using data set that match the case in the master according to thematch criteria you set. Stata does this *even though* the master fileis "held inviolate." So, if you match based on last name only you'llwind up with several duplicate entries for "Smith" where each Smith inthe master is match to every other "Smith" in the using data set.You'll want a unique identifier for each case in the master so you cansee what names end up being duplicated (command: duplicates list"idvar"). Also do a count before the merge and make sure you end upback at that number at the end of the process.

3. Do the matching process across multiple 'passes'. First pass usesthe most match variables and last pass uses the least. Choose the matchvariables carefully so that earlier passes are more trustworthy thanlater passes (e.g. the pass based on last name and birthday shouldprobably come before the pass based on first initial and last name).When you have a match, remove it from the master (and in your caseprobably the using file as well). This is because subsequent matcheswill use less info and there is no reason to think that they will be abetter match than a match from an earlier pass where more of thevariables matched. When you are done with each merge just save the fileto a temporary file name, keep only the matches, and then save the fileas "first pass matches" or some equivalent. Then re-open the temp file,keep only the unmatched (i.e. _merge==1) and save as the "master data"or some equivalent. When you are done just append all the saved matchesdata sets together.*To keep things orderly I like to include the word "master" in thefilename for the master dataset and "using" for the filename for theusing dataset. The syntax can become cumbersome and it's nice toquickly see that I'm using the right datasets in the right places.

4. Clean your string variables carefully and thoroughly. Look out forindividuals with hyphenated names, capitalizations (i.e. McCory), orspaces (i.e. St Joseph). Be sure suffixes aren't contaminating the lastname field. Look out for extra spaces (leading, trailing, or internal)and make the capitalization uniform (all caps or all lowercase). Usethe string functions, in particular regexm, trim, itrim, and the userwritten code 'strip'. You can't clean your string variables enough!

5. In my experience first names tend to vary - one database may havenicknames while the other may have formal names. Resolve this with avariable for first initial. First match using the full first name alongwith the other variables then the next pass should use the first initialinstead along with the other variables. You may want to do this withbirth dates as well - first use the actual date and then just use theyear (or month and year). Obviously make sure you convert dates fromstring format into stata's numeric date format.

6. Name and label your _merge variable from each pass so you know wherethe match came from, how trustworthy it is, what variables would'vematched, etc. These are extremely handy when hand checking or writinglogical statements to drop bad matches.

7. With a large data set the merge variables probably won't uniquelyidentify the cases. This same problem applies to the sort command. Forthis reason when you sort your data sets use the 'stable' option! If youare like me and like to be able to go back and re-run the do file tomake changes or corrections in light of errors you find later on you mayhave a real mess on you hands if you don't use the stable option (Ifound out the hard way). Have a unique sort order at the start (i.e.sort by all the variables) and then always specify the stable optionwith subsequent sorts. I use numeric ids to refer to each case (i.e.when dropping mismatches) and these ids would change each time I ran thedo file because the order of the dataset was not constant when I createdthe ids after the match process.

8. Keep track of duplicates in the master dataset. You can do this withthe egen 'group' command. Generate groups for each set of variables thatyou match on. For any group consisting of more then one tag it asduplicated on varlist (your march variables) using a dummy variable.This is because some matches are not necessarily bad matches but theymay be uncertain matches. In short, they may match based on thevariables you have data for but, because there are duplicates on thosevariables, you can't say for sure if the match is correct or not. Forexample if you have 2 John Smiths who are both missing a birthday andyou have 2 John Smiths in the using data set who each have a birthdaythen a merge will link these up (and generate 2 duplicates) but in theend you have to throw these matches out because you don't know whichcase in the master goes with which case in the using. Having a variablethat identifies cases that have duplicates based on first and last namewill help you write a logical command to drop these uncertain matches.

9. Last and finally, do NOT use the update command! Let's say you havea Franklin Smith in the master file, he is missing his birthday. Nowyou match based on last name and first initial and get a match for FSmith in the using data set (he has a birthday listed). The data in themaster file is not overwritten so only the missing data is carried overinto the master and your Franklin Smith now has a birthday. Soundsgreat but what you cannot see is that Franklin "F Smith" matched toFrederick "F Smith" in the using file - almost certainly not the samecharacter but you cannot discern this info from the mater dataset. Forthis reason **the update command is a baaaad idea unless you are 100%confident the match is and can only be correct.**

When a variable in the master has missing data you can get that datainto the master from the using by opening the using data set before themerge, and duplicating that variable with a slightly different name(e.g. varname_using). Then, once you've eliminated bad matches you canjust copy that data from varname_using to varname if varname=="".***Because the merge will bring varname_using into the master file makesure you drop it for the *unmatched cases* before the next match (youcan do this when you open the master to sort it before the next match).

One more tip: depending on the complexity of the task, the directorystructure, and your cleverness you may be able to use a foreach loop todo the merges.

If you have additional questions about anything I've written here pleasefeel free to contact me personally. I'm not a stata guru but have usedit extensively for precisely the sort of task that you are undertakingwith some very large data sets (e.g. voter rolls). If you'd like tosee some of my do files I've used for this purpose I'd be happy toshare. Caveat emptor.


Will

On 11/8/2010 11:50 PM, Michael Eisenberg wrote:

Colleagues,

I have a database of about 20K men that I'd like to merge with another
database.  I have names (first, middle, and last) as well as date of
birth and social security number for most men.  Unfortunately, the
original database has some missing data on birthdate and social
security numbers.  The new database has most of the birthdate info as
well as the geographic information that I need.

Some men do have the same name.

Is there anyway to merge based on name if it doesn't uniquely identify
men?  I'd like to somehow match all men and then let me manually
compare based on visit dates to decide if it's likely the match is
correct.  If not, any suggestions?

Thanks for you help.

Mike
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: merging datasets
  - From: Michael Eisenberg <[email protected]>

Prev by Date: st: Nonlinear GMM estimation
Next by Date: Re: Re: st: "Fixing" all windows (Results, command, review and variables) with Stata 11 for Mac
Previous by thread: Re: st: merging datasets
Next by thread: st: xtreg with large datasets
Index(es):
- Date
- Thread