Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: identifying duplicate records
From
"William Gould, StataCorp LP" <[email protected]>
To
[email protected]
Subject
Re: st: identifying duplicate records
Date
Mon, 13 Feb 2012 10:21:08 -0600
Raoul Reulen <[email protected]> wrote,
> I have a large database with around 300,000 records of individuals.
> There can be more than one record per individual. Now, how do I
> identify individuals? I assume that it is the same indivual if:
>
> Date of birth and NHS number are the same OR
> date of birth and surname are the same OR
> surname and NHS number are the same.
>
> [...]
>
> What is the best way to approach this? I want to end up with an
> id-number that identifies the individual.
This question has already generated good replies, I'm late to the
conversation, but I still want to contribute my thoughts.
I'm doing that because this question is more generic than many of
you may realize. Have you hever had to merge two datasets with
less than perfect identification data? It's the same problem.
I have two points I wish to make,
1. How to solve the problem from first principles (i.e., basic
Stata commands), and
2. How better to process data with matches that are not certain.
The reason to solve the problem from first principles is that it forces
us to think the problem through more clearly. In this case, we will
hit upon a problem and discover the part 2 is not optional.
But I get ahead of myself.
1. First principles
--------------------
Assume the variables are
dob date of birth
nhsid NHS number
surname surname
We will create new variable id. Raoul offered three rules:
1. Date of birth and NHS number are the same
2. date of birth and surname are the same
3. surname and NHS number are the same.
So I'm going to create three new identification identification variables --
id1, id2, and id3 -- based on the three rules.
The basic solution is,
. gen id1 = . // rule 1
. sort dob nhsid
. by dob nhsid: replace id1 = 1 if _n==1
. replace id1 = sum(id1)
. gen id2 = . // rule 2
. sort dob surname
. by dob surname: replace id2 = 1 if _n==1
. replace id2 = sum(id2)
. gen id3 = . // rule 3
. sort surname nhsid
. by surname nhsid: replace id3 = 1 if _n==1
. replace id3 = sum(id3)
I now have three identifiation variables and I now realize that they
may be inconsistent. Consider the following examples:
nhsid dob surname
1939488 15jan1953 smith
1939488 15jan1953 jones
2234278 15jan1953 smith
and so on
Call those three examples E1, E2, and E3. By rule 1, E1 and E2 match.
By rule 2, E2 and E3 match.
We have a problem.
A better solution
-----------------
We need to write a matching rules more carefully and we need to put
them into an order from most to least certain. In this case, the
rules I suggest are,
R1. nhsid, dob, and surname all match
R2. only nhsid and dob match
R3. only nhsid and surname match
R4. only surname and dob match
R5. only nhsid matches
R6. if none of the above, we will say they are different persons.
The prticular rules do not matter -- that's for Raoul to decide -- as
is the order. Bear with me, because it will eventually become easy to
change the rules and their order.
We are going to create two new variables in our data:
reason the reason the observations were matched (1, 2, ..., 6)
id id number within reason
and from that we will then generate another new variable
finalid over all id number
Assume the original data is called master.dta. We proceed like this:
. use master // use original
// create id var, rule 1
. gen id = .
. sort nhsid dob surname
. by nhsid dob surname: gen id = 1
. replace id = sum(id)
// split data in two
. sort id
. preserve
. by id: keep if _N>1
. gen reason = 1
. save reason1 // ... those who matched
. restore
. by id: keep if _N==1
. save therest, replace // ... and the rest
Having finished with the rule-1 group, we can continue with therest.dta
and apply rule 2:
. use therest // use the rest
// create id var, rule 2
. gen id = .
. sort nhsid dob
. by nhsid dob: gen id = 1
. replace id = sum(id)
// split data in two
. sort id
. preserve
. by id: keep if _N>1
. gen reason = 2
. save reason2 // ... those who matched
. restore
. by id: keep if _N==1
. save therest, replace // ... and the rest
And so we will continue. This is too much typing for me, so I'm going
to make a program of it to save myself typing and reduce the chances of
error.
------------------------------------------------ myfile.do -----
clear all
* in the program below, I added option -emptyok- to -save-
* in case any of the reasons result in 0 matches.
program doit
args reason_no vars
use therest // use the rest
// create id var
gen id = .
sort `vars'
by `vars': gen id = 1
replace id = sum(id)
// split data in two
sort id
preserve
by id: keep if _N>1
gen reason = `reason_no'
// ... those who matched
save reason`reason_no', replace emptyok
restore
by id: keep if _N==1
save therest, replace emptyok // ... and the rest
end
use master, clear
save therest, replace
doit 1 "nhsid dob surname"
doit 2 "nhsid dob"
doit 3 "nhsid surname"
doit 4 "surname dob"
doit 5 "nhsid"
* Now handle the leftovers, who are unmatched:
gen id = _n
gen reason = 6
save reason6, replace emptyok
------------------------------------------------ myfile.do -----
I can now create the first-round results by typing
. do myfile
The result is that I will have new datasets reason1.dta, ..., reason6.dta.
If my do-file has an error, I can fix my do-file and run it again.
I will have updated datasets reason1.dta, ..., reason6.dta.
I would look for poor matches based on data errors in each of the
datasets -- especially reason2.dta though reason5.dta. Based on that,
I might change the variables my rules are based on, change the order
of the rules, or even repair the original variables nhsid, dob, and
surname in the original master.dta!
Aside: I would never actually change master.dta. I would add
lines to my do-file to -replace- values with better values
before saving therest.dta for the first time.
I can change my do-file, type -do myfile-, and I have my new result.
Eventually, I would put the data together thusly,
. do myfile
. gen finalid = .
. sort reason id
. by reason id: replace finalid = 1 if _n==1
. replace finalid = sum(id)
Closing comments
----------------
Anyone who deals with problematic matches needs to learn this general
approach.
1. This general approach above is applied to an "append-ish"
problem, but it works with merge problems as well; only the
programming details differ.
2. The general appraoch is
2.1 Match observations according the the rule that you are
certain is correct. Get those matches out of the way by
saving them in a separate dataset. Your problem is now to
deal with the rest.
2.2 Match the remaining observations according to the rule you
are next most certain about. Get those observations out of
the way by saving them in a separate datset. Your problem
is now to deal with the rest.
2.3 Repeat step 2.2 as long as many times as you feel comfortable
repeating it.
2.4 Look at the results each step of the way. I matched on
nhsid and dob second because I figured mistyping a date of
birth was *MORE* likely than mistyping a surname. Therefore,
matches based on nhsid and dob are more likely to be correct
than are matches on nhsid and surname.
2.5 Write your code so that it is easy to change the order in
which the rules are applied because sometimes, after looking
at the results, you will find your initial suspicions were
wrong.
2.6 As you look at the data, consider adding more rules, or
eliminating rules that produce more noise than you expected.
In this case, if Raoul has sex, I suggest he add sex to
to the list of variables on which he is matching. In that
case, rule 1 would be match on nhsid, dob, surname, and sex,
and more rules would be added, some including sex, some
excluding it.
2.7 Note that in my code, adding a rule just means adding one
line to my do-file. Reordering the the rules just means
changing the order of the -doit- lines, and possibly the
reason number.
2.8 Also note that I can run my code from the beginning. I play
interactively, but when I'm producing final datasets, I know
exactly how I produced them. The proof is that I can
reproduce them.
2.9 Note that I know why I matched observations.
I hope this helps.
Don't be surprised if I turn this into a http://blog.stata.com entry.
If I do, I'll edit to write more clearly. Right now, I wanted to get
this into Raoul's hands.
-- Bill
[email protected]
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/