[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Re: Finding "near"-matches

From	Roger Newson <[email protected]>
To	[email protected]
Subject	Re: st: Re: Finding "near"-matches
Date	Fri, 28 Oct 2005 17:52:46 +0100

Yhis sounds like a job for something similar to Donald Knuth's soundex algorithm. You can find out more about this at

http://www.dcs.ed.ac.uk/home/stg/pub/S/soundex.html

or at

http://us2.php.net/soundex

or at

http://west-penwith.org.uk/misc/soundex.htm

I hope this helps.

Roger

At 16:49 28/10/2005, you wrote:

The topic gets more and more interesting. I often need to match
'fuzzily' the names from two databases that have very minor
differences. here are some examples:

Ford Co.
Ford Corporation
Ford Inc. (just an example)

or

XYZ Tech
XYZ Technology Inc.

Can you recommend some programs to generate a list of 'fuzzy' or
'near' matches for a name (one or more than one alphanumeric
characters)? Even if a program provides the three possible matches for
the name 'Ford', that's still better than hand-checking.

Aaron

On 10/28/05, Seb Buechte <[email protected]> wrote:
> Clyde and Michael,
>
> I also programmed something to find out how similar two strings are
> using the edit-distance-method. The edit-distance between two strings
> is the number of changes required to change one string in such way
> that it equals the other. I admit that what I programmed is somehow
> "quick&dirty" code. If you would like, I can email it to you, but if
> you would like to know how it works you could check out this website:
>
> http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Dynamic/Edit/
>
> There you find a description of the underlying algorithm.
>
> Kind regards,
> sebastian
>
> On 10/27/05, Michael Blasnik <[email protected]> wrote:
> > "Clyde Schechter" <[email protected]> wrote about trying to match not
> > quite identical text strings between datasets. I also spend a great deal of
> > time trying to match across administrative databases and have developed a
> > few tools to help. There is a fair amount of literature on string
> > comparators (e.g., US Census web site) that produce some rating of the
> > similarity of two text strings. I have coded up a couple of them and tend
> > to use the bigram (which counts the proportion of 2 character substrings
> > that exist in both strings). I have also automated some of the common-typo
> > problems (e.g., l vs. 1, 0 vs O) for specific projects where I simply create
> > a new version of each of the strings that replaces all occurences of l and
> > O, with 1 and 0 (and other common errors) before running the string
> > comparison.
> >
> > If there is interest, I can email the bigram ado file or potentially post it
> > on SSC when I get around to writing up the help.
> >
> > Michael Blasnik
> > [email protected] .
> >
> >
> > *
> > * For searches and help try:
> > * http://www.stata.com/support/faqs/res/findit.html
> > * http://www.stata.com/support/statalist/faq
> > * http://www.ats.ucla.edu/stat/stata/
> >
>
> *
> * For searches and help try:
> * http://www.stata.com/support/faqs/res/findit.html
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/


--
Roger Newson
Lecturer in Medical Statistics
Department of Public Health Sciences
Division of Asthma, Allergy and Lung Biology
King's College London

5th Floor, Capital House
42 Weston Street
London SE1 3QD
United Kingdom

Tel: 020 7848 6648 International +44 20 7848 6648
Fax: 020 7848 6620 International +44 20 7848 6620
  or 020 7848 6605 International +44 20 7848 6605
Email: [email protected]
Website: http://phs.kcl.ac.uk/rogernewson/

Opinions expressed are those of the author, not the institution.

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Re: Finding "near"-matches
  - From: "Michael Blasnik" <[email protected]>

References:
- st: Finding "near"-matches
  - From: Clyde Schechter <[email protected]>
- st: Re: Finding "near"-matches
  - From: "Michael Blasnik" <[email protected]>
- Re: st: Re: Finding "near"-matches
  - From: Seb Buechte <[email protected]>
- Re: st: Re: Finding "near"-matches
  - From: Aaron <[email protected]>

Prev by Date: Re: st: Re: Finding "near"-matches
Next by Date: st: Global Macro?
Previous by thread: Re: st: Re: Finding "near"-matches
Next by thread: Re: st: Re: Finding "near"-matches
Index(es):
- Date
- Thread