Dear all,
I would like to select (and later delete) duplicates from a dataset.
However, some duplicates can not be recognized by STATA, because some
variables in my dataset have a poor data-quality. The analysis of the
duplicates is based on a string variable "name".
Simplified, my dataset looks like this:
Name var1 var2
Peter Enterprises 1 2
PeterEnterprises 1 2
Peter!Enterprises 1 2
Geter Enterprises 1 2
"Name" is the only variable which I can use to select duplicates. I know
that there are ways and programs which are able to define a kind of
"similarity-index" which holds information about how similar two or more
variables are on the basis of counting the different characters between the
variables.
Concerning my example this means, that each of the four cases above have a
"similarity index" of 1, because only one letter or character has to be
change to make them equal.
Has anyone an idea how I could define such an index for STATA? My goal is to
use such an index as additional variable, which help me to recheck cases in
which potential duplicates are included.
Thanks for your suggestions and help.
Simon
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/