egen soundex function might be a good starting point
For an example, see http://www.stata.com/statalist/archive/2002-11/msg00480.html
salah mahmud
On Tue, Jun 3, 2008 at 7:51 AM, <[email protected]> wrote:
> Dear all,
>
> I would like to select (and later delete) duplicates from a dataset.
> However, some duplicates can not be recognized by STATA, because some
> variables in my dataset have a poor data-quality. The analysis of the
> duplicates is based on a string variable "name".
>
> Simplified, my dataset looks like this:
>
> Name var1 var2
>
> Peter Enterprises 1 2
> PeterEnterprises 1 2
> Peter!Enterprises 1 2
> Geter Enterprises 1 2
>
>
> "Name" is the only variable which I can use to select duplicates. I know
> that there are ways and programs which are able to define a kind of
> "similarity-index" which holds information about how similar two or more
> variables are on the basis of counting the different characters between the
> variables.
>
> Concerning my example this means, that each of the four cases above have a
> "similarity index" of 1, because only one letter or character has to be
> change to make them equal.
>
> Has anyone an idea how I could define such an index for STATA? My goal is to
> use such an index as additional variable, which help me to recheck cases in
> which potential duplicates are included.
>
> Thanks for your suggestions and help.
> Simon
>
>
> *
> * For searches and help try:
> * http://www.stata.com/support/faqs/res/findit.html
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/