Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: identifying strings that differ on one or two letters

From	"Dimitriy V. Masterov" <[email protected]>
To	[email protected]
Subject	Re: st: identifying strings that differ on one or two letters
Date	Fri, 19 Nov 2010 09:25:31 -0500

I don't know if this will accomplish what you want. Julian Reif's
strgroup might work. It's available from ssc. For example,

strgroup comp_name, gen(match) threshold(.25)

will produce a variable called match:

comp_name	match
Jayanthi chemicals	1
Jayanth chemicals	1
Jay chemicals	2

If you just want to pairwise comparisons of strings, you can use
levenshtein() command that comes bundled with strgroup. For example,
-levenshtein "Jayanthi chemicals" "Jayanth chemicals"- will evaluate
to 1 since you only need to change one letter to get one from the
other.

If you want to do a fuzzy merge using comp_name, then Michael
Blasnik's nearmrg (also at ssc) is your friend.

HTH,
DVM





On Fri, Nov 19, 2010 at 7:59 AM, Dalhia <[email protected]> wrote:
> Hello,
>
> Is there a method in stata to identify strings that differ by just one or two letters?
> For example:
>
> comp_name
>
> Jayanthi chemicals
> Jayanth chemicals
> Jay chemicals
>
> So here the first two should be identified since they differ by only one letter, but not the last one since it differs by 4 letters? Is there a way to do this in stata?
>
> thanks. I appreciate your help.
> dalhia
>
>
>
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: identifying strings that differ on one or two letters
  - From: Dalhia <[email protected]>

Prev by Date: st: non-central chi-square?
Next by Date: st: Adding Up restrictions in LA/AIDS model
Previous by thread: Re: st: identifying strings that differ on one or two letters
Next by thread: Re: st: identifying strings that differ on one or two letters
Index(es):
- Date
- Thread