Dear Stata Listers,
I have a data management problem. I am dealing with an administrative
dataset (patent data). I am trying to assign unique identifiers to
individual patenters. The particular problem that I am trying to
tackle is that of first name/last name inversion. Consider the
following records:
id patent_id fname lname
686 6576756 Knuth Alexander
686 6525177 Knuth Alexander
686 6252052 Knuth Alexander
686 6297364 Knuth Alexander
802408 6339140 Alexander Knuth
802408 6830924 Alexander Knuth
802408 6723832 Alexander Knuth
802408 6774226 Alexander Knuth
802408 6774226 Alexander Knuth
802408 6723832 Alexander Knuth
802408 6830924 Alexander Knuth
802408 6339140 Alexander Knuth
Now it's obvious that Knuth Alexander (id=802408) and Alexander Knuth
(id=686) are one and the same person. The question is how to identify
potential inversions in the data (we have 60,000 distinct identifiers
in the data).
My current thought was to write:
egen fl=group(fname lname)
egen x=nvals(id), by(fl)
keep if x>1
This does not work because egen fl=group(fname lname) and egen
fl=group(lname fname) return different sets of identifiers. I would
want to use an equivalent of egengroup() that is indifferent to the
order of the arguments. Then it would work, and I could flag all the
potentially problematic records.
Could someone help me out?
Thanks
Pierre
----------------------------------------------------------------------------------
Pierre Azoulay
Associate Professor of Management
Columbia University Phone: (212) 854-9684
Graduate School of Business Fax: Don't send any
3022 Broadway, Uris Hall 704
New York, NY 10023 http://www.columbia.edu/~pa2009/
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/