Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: RE: RE: 'Fuzzy' text match

From	David Kantor <[email protected]>
To	[email protected]
Subject	RE: st: RE: RE: 'Fuzzy' text match
Date	Sun, 23 Mar 2014 22:37:44 -0400

Hello Robert,

I haven't followed all the details, but here's what I might be tempted to do.

In each of the datasets, create an alternative company variable thatuses a set of standardized spellings.You would also want to split the name into parts: "Warren EnginePlant Ford Motor Company" would be divided as...

        company: Ford Motor Company
        branch (or division or facility): Warren Engine Plant
And you would match on company, ignoring branch.

Of course, how you go about doing the transformation is anothermatter; could be involved. May be better done outside Stata (or maybe not).But if you can automate the transformation, then the size of 1000000obs would not be an issue.

Maybe you already thought of this, but I wanted to offer my thoughts.
HTH
--David

From: [email protected][[email protected]] on behalf of Robert Davidson[[email protected]]

> Sent: Sunday, March 23, 2014 5:15 PM
> To: [email protected]
> Subject: st: 'Fuzzy' text match
>
> Dear Statalist,
>
> I am trying to do a text match across two files in Stata 13 in which
> the names I want to match will not be the same in the two files.  I
> have looked into options here and tried a few, including strgroup, but
> these do not work for the following reason: in one file I have company
> name e.g. Ford Motor Company, and in the other file I have facility
> name e.g. Warren Engine Plant Ford Motor Company.  strgroup does not
> consider these two strings as even remotely close (Levenshtein
> distance is 22 here) and treats words that have nothing in common as
> being much closer.  Is there a way to measure how much of one string
> appears in another so that cases like the above example might be
> considered reasonably close?  To use strgroup with a threshold that
> would include a match like above, I will wind up with about 98% false
> matches.  Also, my two datasets are about 1,000 observations and
> 1,000,000 observations so doing something manually is quite
> cumbersome.
>
> Thank you,
> Robert Davidson


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: 'Fuzzy' text match
  - From: Robert Davidson <[email protected]>
- st: RE: 'Fuzzy' text match
  - From: Joe Canner <[email protected]>
- st: RE: RE: 'Fuzzy' text match
  - From: Joe Canner <[email protected]>
- Re: st: RE: RE: 'Fuzzy' text match
  - From: Robert Davidson <[email protected]>
- RE: st: RE: RE: 'Fuzzy' text match
  - From: Joe Canner <[email protected]>

Prev by Date: Re: st: Converting table into matrix
Next by Date: st: granger causality test
Previous by thread: RE: st: RE: RE: 'Fuzzy' text match
Next by thread: Re: st: RE: RE: 'Fuzzy' text match
Index(es):
- Date
- Thread