Yhis sounds like a job for something similar to Donald Knuth's soundex
algorithm. You can find out more about this at
http://www.dcs.ed.ac.uk/home/stg/pub/S/soundex.html
or at
http://us2.php.net/soundex
or at
http://west-penwith.org.uk/misc/soundex.htm
I hope this helps.
Roger
At 16:49 28/10/2005, you wrote:
The topic gets more and more interesting. I often need to match
'fuzzily' the names from two databases that have very minor
differences. here are some examples:
Ford Co.
Ford Corporation
Ford Inc. (just an example)
or
XYZ Tech
XYZ Technology Inc.
Can you recommend some programs to generate a list of 'fuzzy' or
'near' matches for a name (one or more than one alphanumeric
characters)? Even if a program provides the three possible matches for
the name 'Ford', that's still better than hand-checking.
Aaron
On 10/28/05, Seb Buechte <[email protected]> wrote:
> Clyde and Michael,
>
> I also programmed something to find out how similar two strings are
> using the edit-distance-method. The edit-distance between two strings
> is the number of changes required to change one string in such way
> that it equals the other. I admit that what I programmed is somehow
> "quick&dirty" code. If you would like, I can email it to you, but if
> you would like to know how it works you could check out this website:
>
> http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Dynamic/Edit/
>
> There you find a description of the underlying algorithm.
>
> Kind regards,
> sebastian
>
> On 10/27/05, Michael Blasnik <[email protected]> wrote:
> > "Clyde Schechter" <[email protected]> wrote about trying to match not
> > quite identical text strings between datasets. I also spend a great
deal of
> > time trying to match across administrative databases and have developed a
> > few tools to help. There is a fair amount of literature on string
> > comparators (e.g., US Census web site) that produce some rating of the
> > similarity of two text strings. I have coded up a couple of them and
tend
> > to use the bigram (which counts the proportion of 2 character substrings
> > that exist in both strings). I have also automated some of the
common-typo
> > problems (e.g., l vs. 1, 0 vs O) for specific projects where I simply
create
> > a new version of each of the strings that replaces all occurences of
l and
> > O, with 1 and 0 (and other common errors) before running the string
> > comparison.
> >
> > If there is interest, I can email the bigram ado file or potentially
post it
> > on SSC when I get around to writing up the help.
> >
> > Michael Blasnik
> > [email protected] .
> >
> >
> > *
> > * For searches and help try:
> > * http://www.stata.com/support/faqs/res/findit.html
> > * http://www.stata.com/support/statalist/faq
> > * http://www.ats.ucla.edu/stat/stata/
> >
>
> *
> * For searches and help try:
> * http://www.stata.com/support/faqs/res/findit.html
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
--
Roger Newson
Lecturer in Medical Statistics
Department of Public Health Sciences
Division of Asthma, Allergy and Lung Biology
King's College London
5th Floor, Capital House
42 Weston Street
London SE1 3QD
United Kingdom
Tel: 020 7848 6648 International +44 20 7848 6648
Fax: 020 7848 6620 International +44 20 7848 6620
or 020 7848 6605 International +44 20 7848 6605
Email: [email protected]
Website: http://phs.kcl.ac.uk/rogernewson/
Opinions expressed are those of the author, not the institution.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/