|
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: RE: Matching Names
Dear Max,
I agree that 52000 is a lot of cases. I've never had to deal with
that many, but your method depends on your tolerance of bad/missed
matches. In my case, where we had to decide if a person named by
respondent A is the same as a person with a similar name described by
respondent B (where we had recorded gender, race/ethnicity,
approximate age), we found that all mechanical matching algorithms
were pretty bad.
My advice is to do it in two parts. Match as much as you can (say
using -soundex- as Kieran suggests), then eyeball the matches (along
with any auxiliary information you have on them (like age, gender, or
whatever is in your dataset). Then do the same for the non-match
side. It's a lot of work, but you only have to do it once. Of
course, if you don't care about mismatches, go the mechanical route.
I'd still eyeball at least a subset to get a mismatch error rate
estimate.
Dave
====================================
David C. Bell
Professor of Sociology
Indiana University Purdue University Indianapolis (IUPUI)
(317) 278-1336
====================================
On Aug 7, 2008, at 6:10 PM, Kieran McCaul wrote:
This is a big problem.
You might want to investigate using soundex to help with matching
the misspelt names but, depending on the version of soundex that you
use, it may not be particularly useful.
Michael Blasnik wrote an egen function to implement a soundex
algorithm a while ago for Stata 7.
http://ideas.repec.org/c/boc/bocode/s420901.html
You could try that.
______________________________________________
Kieran McCaul MPH PhD
WA Centre for Health & Ageing (M573)
University of Western Australia
Level 6, Ainslie House
48 Murray St
Perth 6000
Phone: (08) 9224-2140
Phone: -61-8-9224-2140
email: [email protected]
http://myprofile.cos.com/mccaul
_______________________________________________
-----Original Message-----
From: [email protected] [mailto:[email protected]
] On Behalf Of Max Perez Leon
Sent: Friday, 8 August 2008 5:03 AM
To: [email protected]
Subject: st: Matching Names
Hello statalist users,
I am having a big problem trying to merge to datasets with names.
The problem is
that there are tons of typos in both datasets. Examples bellow:
DATASET 1: --------------------- DATASET 2:
NAMES--------------------------- NAMES
LUIS P�REZ --------------------- LUIS P�REZ
WILLIAM SMITH ------------------ WILLIAM SMITHSS
JORGE F. CHOCAN ---------------- JORGE F CHOCANOS
P. BROWN ----------------------- PAUL BROWN
ENRIQUETA GAUDENCIA------------- ENRIQUETA G
I could do it by hand but I have 52568 obs and more to come. I am
trying to
establish a method using regular expressions so that I can merge
correctly the
datasets.
Any help will be very much appreciated,
Thanks for your time,
Max Perez Leon
PUCP-IEP
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/