Brendan Miller <[email protected]> asked about how to do a
"fuzzy merge"
> [...] based on a string field that contains organization names.
> Unfortunately, the names are not listed equivalently in both databases (e.g.
> "The Miller Corporation" in one vs. "Miller Corp." in the other).
What Brendan wants is a "fuzzy/approximate string matching function" that will
do what he is thinking. I know of no such function and, even if it existed, I
would not recommend he trust it. Instead, I recommend Brendan do the match
himself, tailoring the rules to his particular problem.
The basic method
----------------
At each step of the way, we are going to have three datasets:
resid1.dta resid2.dta: matched.dta:
----------------- ----------------- ---------------
Unmatched records Unmatched records Matched
from 1st dataset from 2nd dataset records
----------------- ----------------- ----------------
At each step, we are going to look at resid1.dta and resid2.dta, spot
some matching rule (such as change "Corp" to Corporation" and then match),
and turn the crank: matching what we can from resid1 and resid2, adding
the matches to matched.dta, and updating resid1.dta and resid2.dta to contain
the new leftovers.
One of the advantages of this method is that, once we are down
to small numbers of unmatched observations, we can apply rules
highly specific to our problem.
Getting Started
---------------
Our first step will be an exact match. Let's assume the original datasets
to be matched are called left.dta and right.dta:
. use right.dta
. sort name
. save, replace
. use left.dta
. sort name
. merge name using right
. sort name
. save result /* <- just temporarily */
. keep if _merge==3
. drop _merge
. save matched
. use result
. keep if _merge==1
. keep name
. merge name using left
. keep if _merge==3
. drop _merge
. save resid1
. use result
. keep if _merge==2
. keep name
. mege name using right
. keep if merge==3
. drop _merge
. save resid1
. erase result
We are going to ahve to do something like that each iteration, and that is too
much typing for us every to get right time after time. Moreover, the next
time, we are going to have to add to matched.dta, and that is just going to
complicate the issue. So let's write a program:
program merge3
confirm file matched.dta
confirm file resid1.dta
confirm file resid2.dta
use resid1, clear
merge name using resid2
if _N==0 {
exit
}
save result
keep if _merge==3
drop _merge
append using matched
save matched, replace
use result, clear
keep if _merge==1
keep name
merge name using resid1
keep if _merge==3
drop _merge
sort name
save resid1, replace emptyok
use result, clear
keep if _merge==2
keep name
merge name using resid2
keep if _merge==3
drop _merge
sort name
save resid2, replace emptyok
erase result.dta
end
Starting over
-------------
With that, let's start over. First, we need to do the setup to run our
program:
. clear
. save matched, emptyok
. use left
. sort name
. save resid1
. use right
. sort name
. save resid2
Now let's do the exact match:
. merge3
Step 2
------
At this point, we have the three datasets. Look at resid1.dta and resid2.dta.
Spot a problem. Let's pretend we see things like "Stata Corp" and Stata
Corp.", and "ABC, Inc." and "ABC Inc".
Let's get rid of periods and commads in the name:
. use resid1, clear
. replace name = subinstr(name, ".", "", .)
. replace name = subinstr(name, ",", "", .)
. sort name
. save, replace
. use resid2, clear
. replace name = subinstr(name, ".", "", .)
. replace name = subinstr(name, ",", "", .)
. sort name
. save, replace
Now we can perform the second step of the merge:
. merge3
Basically, we keep working like that until we have merged all the
observations we think reasonable.
-- Bill
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/