[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: identifying last name/first name inversions in administrative datasets

From	Pierre Azoulay <[email protected]>
To	[email protected]
Subject	st: identifying last name/first name inversions in administrative datasets
Date	Wed, 12 Oct 2005 08:53:55 -0400

Dear Stata Listers,

I have a data management problem. I am dealing with an administrative
dataset (patent data). I am trying to assign unique identifiers to
individual patenters. The particular problem that I am trying to
tackle is that of first name/last name inversion. Consider the
following records:

id	   patent_id	   fname       lname
686	  6576756	Knuth	    Alexander
686	  6525177	Knuth	    Alexander
686	  6252052	Knuth	    Alexander
686	  6297364	Knuth	    Alexander
802408	6339140	     Alexander	Knuth
802408	6830924	     Alexander	Knuth
802408	6723832	     Alexander	Knuth
802408	6774226	     Alexander	Knuth
802408	6774226	     Alexander	Knuth
802408	6723832	     Alexander	Knuth
802408	6830924	     Alexander	Knuth
802408	6339140	    Alexander	Knuth

Now it's obvious that Knuth Alexander (id=802408) and Alexander Knuth
(id=686) are one and the same person. The question is how to identify
potential inversions in the data (we have 60,000 distinct identifiers
in the data).

My current thought was to write:

egen fl=group(fname lname)
egen x=nvals(id), by(fl)
keep if x>1

This does not work because egen fl=group(fname lname) and egen
fl=group(lname fname) return different sets of identifiers. I would
want to use an equivalent of egengroup() that is indifferent to the
order of the arguments. Then it would work, and I could flag all the
potentially problematic records.

Could someone help me out?

Thanks

Pierre
----------------------------------------------------------------------------------
Pierre Azoulay
Associate Professor of Management
Columbia University                           Phone: (212) 854-9684
Graduate School of Business             Fax:   Don't send any
3022 Broadway, Uris Hall 704
New York, NY 10023                        http://www.columbia.edu/~pa2009/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: identifying last name/first name inversions inadministrative datasets
  - From: "Jamie Griffin" <[email protected]>
- Re: st: identifying last name/first name inversions inadministrative datasets
  - From: Rafal Raciborski <[email protected]>

Prev by Date: st: Re: question on looping
Next by Date: Re: st: Re: question on looping
Previous by thread: st: Re: question on looping
Next by thread: Re: st: identifying last name/first name inversions inadministrative datasets
Index(es):
- Date
- Thread