Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: identifying duplicate records

From	Nick Cox <[email protected]>
To	"'[email protected]'" <[email protected]>
Subject	RE: st: identifying duplicate records
Date	Fri, 10 Feb 2012 16:49:50 +0000

Let's be clear on this: 

-duplicates- will fail to find "duplicates" that are not in fact duplicates. It nowhere and in no way is designed to find _approximate_ duplicates. 

Nick 
[email protected] 

Dimitriy V. Masterov

I assume you've already tried the duplicates command on various
combination of the id variables and that did not work.

I would create a combination id that concatenates the dob, NHS # and
surname. Then use the user-written strgroup on this variable. This
approach will still require a bit of manual work.

A non-Stata approach is to use Google Refine.

On Fri, Feb 10, 2012 at 9:08 AM, raoul reulen <[email protected]> wrote:

> Just wondering if I could get some advice. I have a large database
> with around 300,000 records of individuals. There can be more than one
> record per individual.  Now, how do I identify individuals? I assume
> that it is the same indivual if:
>
> Date of birth and NHS number are the same  OR
> date of birth and surname are the same OR
> surname and NHS number are the same.
>
> So there are various combinations possible. A date of birth could have
> typos in it; but if the NHS number and the surname are the same then I
> assume it is the same person. The NHS number can have typos, but if
> the date of birth and the surname are the same I will assume it is the
> same person.
>
>  What is the best way to approach this?  I want to end up with an
> id-number that identifies the individual.  Many thanks for your help.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: identifying duplicate records
  - From: "Dimitriy V. Masterov" <[email protected]>

References:
- st: identifying duplicate records
  - From: raoul reulen <[email protected]>
- Re: st: identifying duplicate records
  - From: "Dimitriy V. Masterov" <[email protected]>

Prev by Date: st: RE: Getting variable names in a matrix
Next by Date: st: Panel data with simultaneous equations
Previous by thread: Re: st: identifying duplicate records
Next by thread: Re: st: identifying duplicate records
Index(es):
- Date
- Thread