Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: 'Fuzzy' text match

From	Robert Davidson <[email protected]>
To	[email protected]
Subject	st: 'Fuzzy' text match
Date	Sun, 23 Mar 2014 17:15:15 -0400

Dear Statalist,

I am trying to do a text match across two files in Stata 13 in which
the names I want to match will not be the same in the two files.  I
have looked into options here and tried a few, including strgroup, but
these do not work for the following reason: in one file I have company
name e.g. Ford Motor Company, and in the other file I have facility
name e.g. Warren Engine Plant Ford Motor Company.  strgroup does not
consider these two strings as even remotely close (Levenshtein
distance is 22 here) and treats words that have nothing in common as
being much closer.  Is there a way to measure how much of one string
appears in another so that cases like the above example might be
considered reasonably close?  To use strgroup with a threshold that
would include a match like above, I will wind up with about 98% false
matches.  Also, my two datasets are about 1,000 observations and
1,000,000 observations so doing something manually is quite
cumbersome.

Thank you,
Robert Davidson
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- st: RE: 'Fuzzy' text match
  - From: Joe Canner <[email protected]>

Prev by Date: Re: st: RE: Drop under several conditions and time (year)
Next by Date: RE: st: RE: Drop under several conditions and time (year)
Previous by thread: st: encode command and then drop
Next by thread: st: RE: 'Fuzzy' text match
Index(es):
- Date
- Thread