Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: fuzzy match two data sets with strgroup

From	Koleman Strumpf <[email protected]>
To	<[email protected]>
Subject	st: fuzzy match two data sets with strgroup
Date	Fri, 29 Nov 2013 11:48:59 -0600

Hi Statalist:

I have two data sets which I would like to match based on a variable(Match_Var). The Match_Var is slightliy different in the two files dueto treatment of non-standard characters, truncations of the string, andsome other small changes. But I want to pair the two files up as best asI can.

I would like to use strgroup for this purpose. The main difficulty I amencountering is that the main output of this (the group variable) willmatch strings regardless of the dataset. For example, the group variablemay contain two observations from the same data set while I am onlyinterested in matching observations from different data sets. That is Ido something like this,

     . use unmatched1.dta, clear
     . append using unmatched2.dta
     . strgroup Match_Var if _merge!=3, gen(group) threshold(0.02) force

where _merge=1 in unmatched1.dta and _merge=2 in unmatched2.dta. I willhave groups like,

     Match_Var                    _merge        group
     somestuff123.txt        1                    359
     somestuff124.txt        1                    359

Since I only am interested in matching observations from different filesthis is not a useful pairing (that is, _merge is the same for both ofthese observations).


I have not been able to avoid these cases. Some things I have tried:
- decrease threshold --> but then I end up missing most of the pairs I want

- increase thresholds --> work better, and include a mix of observationsfrom the two observations; but then I am not sure I have a one-for-onepairing between the two files- sort based on Match_Var after running strgroup (sort group Match_Var_merge) --> this does not work in my case since it will typically groupall of the observations from one data file first and then the ones fromthe other file)

Are there any suggestions on dealing with this? I would imagine this isa pretty standard use of the fuzzy match so perhaps someone hasencountered this problem and has a solution.

PS I know there is also the reclink command which seems to do what Iwant. However as has been documented elsewhere in the statalist thiscommand is fragile and often crashes before giving any results, e.g.

     . reclink Match_Var using file1.dta, gen(myscore) idm(id_1) idu(id_2)
     0 perfect matches found

Going through 54513 observation to assess fuzzy matches, each .=5%complete

     option KING not allowed
     r(198);

I have tried removing as many characters as I could to avoid this but Ican never get the command to make any progress so I have given up on it.



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Prev by Date: Re: st: converting string to date
Next by Date: Re: st: converting string to date
Previous by thread: Re: st: converting string to date
Next by thread: st: Misaligned labels in dot chart
Index(es):
- Date
- Thread