[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: data problem - duplicates

From	"Salah Mahmud" <[email protected]>
To	[email protected]
Subject	Re: st: data problem - duplicates
Date	Tue, 3 Jun 2008 08:29:17 -0500

egen soundex function might be a good starting point
For an example, see http://www.stata.com/statalist/archive/2002-11/msg00480.html

salah mahmud

On Tue, Jun 3, 2008 at 7:51 AM,  <[email protected]> wrote:
> Dear all,
>
> I would like to select (and later delete) duplicates from a dataset.
> However, some duplicates can not be recognized by STATA, because some
> variables in my dataset have a poor data-quality. The analysis of the
> duplicates is based on a string variable "name".
>
> Simplified, my dataset looks like this:
>
> Name                      var1          var2
>
> Peter Enterprises          1                   2
> PeterEnterprises           1                   2
> Peter!Enterprises          1                   2
> Geter Enterprises          1                   2
>
>
> "Name" is the only variable which I can use to select duplicates. I know
> that there are ways and programs which are able to define a kind of
> "similarity-index" which holds information about how similar two or more
> variables are on the basis of counting the different characters between the
> variables.
>
> Concerning my example this means, that each of the four cases above have a
> "similarity index" of 1, because only one letter or character has to be
> change to make them equal.
>
> Has anyone an idea how I could define such an index for STATA? My goal is to
> use such an index as additional variable, which help me to recheck cases in
> which potential duplicates are included.
>
> Thanks for your suggestions and help.
> Simon
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: data problem - duplicates
  - From: <[email protected]>

Prev by Date: st: RE: how to deal with categories?
Next by Date: st: RE: tabstat question
Previous by thread: st: data problem - duplicates
Next by thread: Re: st: data problem - duplicates
Index(es):
- Date
- Thread