"Clyde Schechter" <[email protected]> wrote about trying to match not
quite identical text strings between datasets. I also spend a great deal of
time trying to match across administrative databases and have developed a
few tools to help. There is a fair amount of literature on string
comparators (e.g., US Census web site) that produce some rating of the
similarity of two text strings. I have coded up a couple of them and tend
to use the bigram (which counts the proportion of 2 character substrings
that exist in both strings). I have also automated some of the common-typo
problems (e.g., l vs. 1, 0 vs O) for specific projects where I simply create
a new version of each of the strings that replaces all occurences of l and
O, with 1 and 0 (and other common errors) before running the string
comparison.
If there is interest, I can email the bigram ado file or potentially post it
on SSC when I get around to writing up the help.