[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Re: Finding "near"-matches

From	Seb Buechte <[email protected]>
To	[email protected]
Subject	Re: st: Re: Finding "near"-matches
Date	Fri, 28 Oct 2005 15:13:53 +0200

Clyde and Michael,

I also programmed something to find out how similar two strings are
using the edit-distance-method. The edit-distance between two strings
is the number of changes required to change one string in such way
that it equals the other. I admit that what I programmed is somehow
"quick&dirty" code. If you would like, I can email it to you, but if
you would like to know how it works you could check out this website:

http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Dynamic/Edit/

There you find a description of the underlying algorithm.

Kind regards,
sebastian

On 10/27/05, Michael Blasnik <[email protected]> wrote:
> "Clyde Schechter" <[email protected]> wrote about trying to match not
> quite identical text strings between datasets.  I also spend a great deal of
> time trying to match across administrative databases and have developed a
> few tools to help.  There is a fair amount of literature on string
> comparators (e.g., US Census web site) that produce some rating of the
> similarity of two text strings.  I have coded up a couple of them and tend
> to use the bigram (which counts the proportion of 2 character substrings
> that exist in both strings).  I have also automated some of the common-typo
> problems (e.g., l vs. 1, 0 vs O) for specific projects where I simply create
> a new version of each of the strings that replaces all occurences of l and
> O, with 1 and 0 (and other common errors) before running the string
> comparison.
>
> If there is interest, I can email the bigram ado file or potentially post it
> on SSC when I get around to writing up the help.
>
> Michael Blasnik
> [email protected] .
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Re: Finding "near"-matches
  - From: Aaron <[email protected]>

References:
- st: Finding "near"-matches
  - From: Clyde Schechter <[email protected]>
- st: Re: Finding "near"-matches
  - From: "Michael Blasnik" <[email protected]>

Prev by Date: Re: st: Recode slow
Next by Date: st: Re: Batch variable?
Previous by thread: st: Re: Finding "near"-matches
Next by thread: Re: st: Re: Finding "near"-matches
Index(es):
- Date
- Thread