[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: Re: Finding "near"-matches

From	"Michael Blasnik" <[email protected]>
To	<[email protected]>
Subject	st: Re: Finding "near"-matches
Date	Thu, 27 Oct 2005 11:15:42 -0400

"Clyde Schechter" <[email protected]> wrote about trying to match not quite identical text strings between datasets. I also spend a great deal of time trying to match across administrative databases and have developed a few tools to help. There is a fair amount of literature on string comparators (e.g., US Census web site) that produce some rating of the similarity of two text strings. I have coded up a couple of them and tend to use the bigram (which counts the proportion of 2 character substrings that exist in both strings). I have also automated some of the common-typo problems (e.g., l vs. 1, 0 vs O) for specific projects where I simply create a new version of each of the strings that replaces all occurences of l and O, with 1 and 0 (and other common errors) before running the string comparison.

If there is interest, I can email the bigram ado file or potentially post it on SSC when I get around to writing up the help.

Michael Blasnik
[email protected] .

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Re: Finding "near"-matches
  - From: Seb Buechte <[email protected]>

References:
- st: Finding "near"-matches
  - From: Clyde Schechter <[email protected]>

Prev by Date: RE: st: outsheeting highest frequency values
Next by Date: st: RE: RE: Y axis values for hist ,density
Previous by thread: st: Finding "near"-matches
Next by thread: Re: st: Re: Finding "near"-matches
Index(es):
- Date
- Thread