Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: RE: RE: 'Fuzzy' text match
From
Robert Davidson <[email protected]>
To
[email protected]
Subject
Re: st: RE: RE: 'Fuzzy' text match
Date
Tue, 25 Mar 2014 09:04:20 -0400
Joe,
Thank you for the idea and code. I tried this on a reduced sample and
manually inspected the matches; it appears to work better than any
other options I have tried.
Best,
Rob
On Sun, Mar 23, 2014 at 6:59 PM, Joe Canner <[email protected]> wrote:
> Robert,
>
> Here is a brute force method to do what you want to do. It assumes that there is a variable -Company- in both data sets.
>
> ****
> . use bigdata, clear
> . forvalues x=1/`=N' {
> local key=Company[`x']
> use if strpos(Company,"`key'") using smalldata
> gen matching="`key'"
> if `x' == 1 {
> save match, replace
> }
> else {
> append using match
> save match, replace
> }
> use bigdata, clear
> }
>
> gen matching=Company
> merge 1:m matching using match
> ****
> This assumes that the company name in the big data set is contained within the company name on the small data set. If it is the other way around, just switch the order of the arguments in the -strpos()- function call.
>
> There may be ways to simplify this if there is exactly one match, but I haven't played around much with that. Also note that this doesn't do a very good job of accounting for no matches, so you may need to add some code to deal with that situation.
>
> Regards,
> Joe
>
> ________________________________
> From: [email protected] [[email protected]] on behalf of Robert Davidson [[email protected]]
> Sent: Sunday, March 23, 2014 6:19 PM
> To: [email protected]
> Subject: Re: st: RE: RE: 'Fuzzy' text match
>
> Joe,
>
> Thank you for the response. In the vast majority of cases the
> complete name of variable-file 1 will be in the name of variable-file
> 2. The cases where it won't will be when one file has 'ABC
> corporation' and the other has 'ABC corp' and I can write code to deal
> with the 7 or 8 common ways those things will arise. In the interim,
> I will look into the options you have suggested.
>
> On Sun, Mar 23, 2014 at 5:36 PM, Joe Canner <[email protected]> wrote:
>> P.S. If you haven't already, check out -reclink-, -vmatch-, and -nearmrg-, all available from SSC. I don't know how they handle this problem, but they might be worth a look.
>> ________________________________________
>> From: [email protected] [[email protected]] on behalf of Joe Canner [[email protected]]
>> Sent: Sunday, March 23, 2014 5:30 PM
>> To: [email protected]
>> Subject: st: RE: 'Fuzzy' text match
>>
>> Robert,
>>
>> Do all comparisons between the two data sets follow the same pattern, e.g., the name in one file is exactly contained within the name in the other file? If so, you can use the -strpos()- function. This will still be challenging to do as a -merge-, but if you come back with a positive answer to the above question, I (or someone else here) can suggest some code that might work in this situation. It would probably involve using the shorter file as a look-up table for the longer file.
>>
>> Regards,
>> Joe Canner
>> Johns Hopkins University School of Medicine
>> ________________________________________
>> From: [email protected] [[email protected]] on behalf of Robert Davidson [[email protected]]
>> Sent: Sunday, March 23, 2014 5:15 PM
>> To: [email protected]
>> Subject: st: 'Fuzzy' text match
>>
>> Dear Statalist,
>>
>> I am trying to do a text match across two files in Stata 13 in which
>> the names I want to match will not be the same in the two files. I
>> have looked into options here and tried a few, including strgroup, but
>> these do not work for the following reason: in one file I have company
>> name e.g. Ford Motor Company, and in the other file I have facility
>> name e.g. Warren Engine Plant Ford Motor Company. strgroup does not
>> consider these two strings as even remotely close (Levenshtein
>> distance is 22 here) and treats words that have nothing in common as
>> being much closer. Is there a way to measure how much of one string
>> appears in another so that cases like the above example might be
>> considered reasonably close? To use strgroup with a threshold that
>> would include a match like above, I will wind up with about 98% false
>> matches. Also, my two datasets are about 1,000 observations and
>> 1,000,000 observations so doing something manually is quite
>> cumbersome.
>>
>> Thank you,
>> Robert Davidson
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/