Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: st: RE: RE: 'Fuzzy' text match
From
David Kantor <[email protected]>
To
[email protected]
Subject
RE: st: RE: RE: 'Fuzzy' text match
Date
Sun, 23 Mar 2014 22:37:44 -0400
Hello Robert,
I haven't followed all the details, but here's what I might be tempted to do.
In each of the datasets, create an alternative company variable that
uses a set of standardized spellings.
You would also want to split the name into parts: "Warren Engine
Plant Ford Motor Company" would be divided as...
company: Ford Motor Company
branch (or division or facility): Warren Engine Plant
And you would match on company, ignoring branch.
Of course, how you go about doing the transformation is another
matter; could be involved. May be better done outside Stata (or maybe not).
But if you can automate the transformation, then the size of 1000000
obs would not be an issue.
Maybe you already thought of this, but I wanted to offer my thoughts.
HTH
--David
From: [email protected]
[[email protected]] on behalf of Robert Davidson
[[email protected]]
> Sent: Sunday, March 23, 2014 5:15 PM
> To: [email protected]
> Subject: st: 'Fuzzy' text match
>
> Dear Statalist,
>
> I am trying to do a text match across two files in Stata 13 in which
> the names I want to match will not be the same in the two files. I
> have looked into options here and tried a few, including strgroup, but
> these do not work for the following reason: in one file I have company
> name e.g. Ford Motor Company, and in the other file I have facility
> name e.g. Warren Engine Plant Ford Motor Company. strgroup does not
> consider these two strings as even remotely close (Levenshtein
> distance is 22 here) and treats words that have nothing in common as
> being much closer. Is there a way to measure how much of one string
> appears in another so that cases like the above example might be
> considered reasonably close? To use strgroup with a threshold that
> would include a match like above, I will wind up with about 98% false
> matches. Also, my two datasets are about 1,000 observations and
> 1,000,000 observations so doing something manually is quite
> cumbersome.
>
> Thank you,
> Robert Davidson
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/