BPM <[email protected]> is having difficulty merging datasets, and
writes,
> I need to match merge two datasets, one data is a subset of the other
> data at the individual level and contains with extra variable fields.
> The data read into STATA without issue, including attached labels.
> [...]
> [...] I do not get an error message. I just get _merge = 1 or 2 and
> not _merge = 3.
>
> The variables to match include: lastname, firstname, ssn, incidentdate
Since you do not get an error message, and since _merge == 1 or 2,
there is one thing you can be sure of, and that is that there is not
one occurance of (lastname, firstname, ssn, incedentdate) in one
dataset that appears in the other. What you need to ask yourself is
how that can be given that you think there are occurances that are in
both.
Let's start with lastname and firstname. Are they string variables?
They should be. I ask because you mention "including attached labels"
and I wonder if the two name fields are perhaps being run through
value labels. That could cause problems. If they are being run through
value labels, do the following in each dataset,
. decode lastname, gen(lname)
. decode firstname, gen(fname)
. drop lastname firstname
. rename lname lastname
. rename fname firstname
Okay, now they are strings. Now let's worry about leading or trailing
blanks. Do the following:
. replace lastname = trim(lastname)
. replace firstname = trim(firstname)
Also check capitalization. Are they capitalized the same way between the
two datasets? If not, let's just convert to lower case:
. replace lastname = lower(lastname)
. replace firstname = lower(firstname)
Now what about variable ssn? Is it string, numeric, or value-labeled numeric?
It should either be a string or a numeric without a value label, and if
the latter, make sure it is a double, since only a double has sufficient
number of digits to hold a US SSN without rounding. If it is numeric and
not a double, you will have to go back and read the original data again.
There is no undoing rounding errors after the fact.
If variable ssn is a string, then
. replace ssn = trim(ssn)
just to make sure there are no leading or trailing blanks.
Finally, we are on incidentdate. Is that a string variable? Fine if it
is (but get rid of leading and trialing blanks), and realize that
string variables only match if they are literally the same. "March 3, 2008"
is not the same as "March 03, 2003", or "3/3/2003", etc. If the formats
are different, you will have to make incidenetdate into a Stata date before
you will be able to match them.
I'm hoping something I just said will help, but I worry it will not because
I'm just making shots in the dark. So let's assume that nothing above
solves the problem. In that case, let's start simplifying the problem
until it becomes obvious to us just exactly what is wrong. With that
knowledge, we can go back, fix the problem, and fix the two datasets.
Let me assume the original two datasets are called super.dta and sub.dta.
Do the following:
. use super
. keep lastname
. sort lastname
. by lastname: keep if _n==1
. save one
. use sub
. keep lastname
. sort lastname
. by lastname: keep if _n==1
. save two
. use one
. merge lastname using two
. tabulate _merge
The point of this experiment is to find out whether lastname ever merges.
Certainly some lastnames are the same between the two datasets. What I
did was keep each name (but only once) in each dataset, and then merge.
There are two possiblities from this experiment: Either lastname did merge,
and so we can dismiss variable lastname as the cause of the problem, or
lastname did not merge. If it did not merge, go digging in one.dta and
two.dta, find an example that you think should have merged, and then figure
out why they did not.
Now do the above experiment substituting firstname everywhere lastname
appears. Same questions, same procedure.
Now do the above experiment substituting ssn, and then do it again
substituting incidencedate.
By now, I suspect you will have found the problem. But let's assume
you have not. This time, let's repeat the experiment, but with
firstname and lastname:
. use super
. keep lastname firstname
. sort lastname firstname
. by lastname firstname: keep if _n==1
. save one
. use sub
. keep lastname firstname
. sort lastname firstname
. by lastname firstname: keep if _n==1
. save two
. use one
. merge lastname firstname using two
. tabulate _merge
Did that work? No? Figure out why. Yes? Do it again, but with three
variables, namely lastname, firstname, and ssn. And then with all four.
Eventually, it will become obvious what the problem is.
Good luck.
-- Bill
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/