I have a dataset, one of whose variables contains names of drugs. Many of
the entries are misspelled or truncated. I have an index file with a
reasonably complete list of commercial and generic drug names. After
merging the files and identifying exact matches, I would like to try to
match the remaining, presumably misspelled, drug names with a corresponding
correct name from the index. When the names are of people, the soundex
algorithm usually provides a reasonably short list of candidate matches.
But trying it with these drug names, many of the misspellings match with
several dozen candidates, making the resulting list of names and candidate
matches for manual review and selection unworkably long.
Does anybody out there know of an alternative to soundex coding that might
work better in this peculiar vocabulary? Or of another approach to this
problem?
Thanks in advance for any help.
Clyde Schechter
Dept. of Family Medicine & Community Health
Albert Einstein College of Medicine
Bronx, NY, USA
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/