[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: data problem - duplicates

From	<[email protected]>
To	<[email protected]>
Subject	st: data problem - duplicates
Date	Tue, 3 Jun 2008 14:51:32 +0200

Dear all,

I would like to select (and later delete) duplicates from a dataset.
However, some duplicates can not be recognized by STATA, because some
variables in my dataset have a poor data-quality. The analysis of the
duplicates is based on a string variable "name".

Simplified, my dataset looks like this:

Name			  var1		var2

Peter Enterprises	   1		       2
PeterEnterprises	   1		       2
Peter!Enterprises	   1		       2
Geter Enterprises	   1		       2


"Name" is the only variable which I can use to select duplicates. I know
that there are ways and programs which are able to define a kind of
"similarity-index" which holds information about how similar two or more
variables are on the basis of counting the different characters between the
variables. 

Concerning my example this means, that each of the four cases above have a
"similarity index" of 1, because only one letter or character has to be
change to make them equal. 

Has anyone an idea how I could define such an index for STATA? My goal is to
use such an index as additional variable, which help me to recheck cases in
which potential duplicates are included. 

Thanks for your suggestions and help.
Simon


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: data problem - duplicates
  - From: Phil Schumm <[email protected]>
- Re: st: data problem - duplicates
  - From: "Salah Mahmud" <[email protected]>

Prev by Date: st: xtmixed documentation
Next by Date: Re: st: how to deal with categories?
Previous by thread: st: xtmixed documentation
Next by thread: Re: st: data problem - duplicates
Index(es):
- Date
- Thread