Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: fuzzy match two data sets with strgroup
From 
 
Koleman Strumpf <[email protected]> 
To 
 
<[email protected]> 
Subject 
 
st: fuzzy match two data sets with strgroup 
Date 
 
Fri, 29 Nov 2013 11:48:59 -0600 
Hi Statalist:
I have two data sets which I would like to match based on a variable 
(Match_Var). The Match_Var is slightliy different in the two files due 
to treatment of non-standard characters, truncations of the string, and 
some other small changes. But I want to pair the two files up as best as 
I can.
I would like to use strgroup for this purpose. The main difficulty I am 
encountering is that the main output of this (the group variable) will 
match strings regardless of the dataset. For example, the group variable 
may contain two observations from the same data set while I am only 
interested in matching observations from different data sets.  That is I 
do something like this,
     . use unmatched1.dta, clear
     . append using unmatched2.dta
     . strgroup Match_Var if _merge!=3, gen(group) threshold(0.02) force
where _merge=1 in unmatched1.dta and _merge=2 in unmatched2.dta. I will 
have groups like,
     Match_Var                    _merge        group
     somestuff123.txt        1                    359
     somestuff124.txt        1                    359
Since I only am interested in matching observations from different files 
this is not a useful pairing (that is, _merge is the same for both of 
these observations).
I have not been able to avoid these cases. Some things I have tried:
- decrease threshold --> but then I end up missing most of the pairs I want
- increase thresholds --> work better, and include a mix of observations 
from the two observations; but then I am not sure I have a one-for-one 
pairing between the two files
- sort based on Match_Var after running strgroup (sort group Match_Var 
_merge) --> this does not work in my case since it will typically group 
all of the observations from one data file first and then the ones from 
the other file)
Are there any suggestions on dealing with this? I would imagine this is 
a pretty standard use of the fuzzy match so perhaps someone has 
encountered this problem and has a solution.
K
PS I know there is also the reclink command which seems to do what I 
want. However as has been documented elsewhere in the statalist this 
command is fragile and often crashes before giving any results, e.g.
     . reclink Match_Var using file1.dta, gen(myscore) idm(id_1) idu(id_2)
     0 perfect matches found
     Going through 54513 observation to assess fuzzy matches, each .=5% 
complete
     option KING not allowed
     r(198);
I have tried removing as many characters as I could to avoid this but I 
can never get the command to make any progress so I have given up on it.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/