In managing data for projects, I sometimes get requests from the
investigators for reports relating to the subjects in a specific list
they provide. Typically the list is given as a text or Excel file
containing the subjects' unique study identifier (string variable),
and the first two letters of the subjects' first and last names (as a
redundancy check).
The study identifiers in the list often contain errors, and when I
merge them with the file of valid study ID's usually several will
come up unmatched. When I then merge on the initials in the file of
valid IDs, I almost always can find a "near" match to the ID in the
list. Sometimes, though, there will be dozens of IDs with the
matching initials and I will have to scan a long list visually to
find the near ID. Once I find the near match, I get back to the
requester asking them to verify that that is who they actually
wanted. I've never been wrong yet, though, rarely, there will be two
near-matches that I can't choose between.
But visual scanning for near matches can be slow, and if the list of
mismatched IDs is long, and contains people with common initials, it
can take a lot of time. I have noticed that most of the incorrect
IDs fall into a small number of error categories: confusing letter
"O" and number "0", confusing letter "l" and number "1", inverting
the order of some digits, omitting one or more leading zeroes, or
omitting one or more digits from a repeated sequence (e.g. listing
32257A1 when it should be 322257A1).
I've been thinking of trying to write a program to look for near
matches based on this taxonomy of errors. But each of these errors
(except dropping leading zero) can occur at any position in the ID
string, and typically these strings run 10 to 16 characters. So this
leaves a lot of possibilities to check and coding all of that seems
ponderous. So I'm wondering if anybody knows of an efficient way to
do this, or even an existing program that does this or something
close to it. (Getting near matches for names is pretty easy using
Soundex, but that won't work for arbitrary alphanumeric ID strings.)
(I know that designing ID numbers with built-in redundancy like an
error-correcting code would be a better solution, but typically I
don't have control over that: the IDs are usually medical record
numbers from various institutions and I have to use them as is.)
Thanks for any suggestions.
Clyde Schechter
Dept. of Family and Social Medicine
Albert Einstein College of Medicine
Bronx, NY, USA