Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: RE: Re: comparing strings off by one character

From	"Brill, Robert" <[email protected]>
To	"'[email protected]'" <[email protected]>
Subject	st: RE: RE: Re: comparing strings off by one character
Date	Wed, 12 Mar 2014 21:36:05 +0000

I've used -strgroup- for a similar project with very good results. The ability to select a  Levenshtein threshold is very useful. Obviously, however, this is a very large dataset, and -strgroup- may have difficulty with that (100,000 pairwise comparisons is a lot), but there are certainly options to subset the data. 

Using -strgroup- and then -duplicates tag- with year and group  would seemingly result in very few cases in which there will need to be work done by hand. Especially because each company should (if I understand correctly) have only one observation for each time period. 

Best,

Rob Brill
Child and Family Research Partnership 
 The University of Texas at Austin

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Joe Canner
Sent: Wednesday, March 12, 2014 3:58 PM
To: [email protected]
Subject: st: RE: Re: comparing strings off by one character

Maria,

I don't know much about -strgroup-, but it looked interesting so I tried to learn more....

It looks like -strgroup- can group observations based on their Levenshtein distance (given a certain threshold) and assign each set of matches a unique number.  I wonder why you couldn't just use that number to identify companies from here on, instead of having to fix the names so that they match?

Regards,
Joe Canner
Johns Hopkins University School of Medicine





-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Maria Boutchkova
Sent: Wednesday, March 12, 2014 3:39 PM
To: statalist
Subject: st: Re: comparing strings off by one character

Dear Statalisters,

I am dealing with the following problem under the general rubric of string comparison.
I have a string variable name containing company names over 17 years.
Sometimes for some years the data entry person has made a typo and the company name is off from the way it is entered the rest of the years by one character. Initial collapsing by name results in close to 100K unique names, therefore automation is a must.
This post has been very helpful to me so far, but I am not quite there yet.
http://www.stata.com/statalist/archive/2012-03/msg01135.html

Here is what I have so far:

collapse (first) v1 v2 (max) v3, by(name) sort name gen name_prev = name[_n-1] if _n > 1 order name name_prev levenshtein name name_prev, gen(levstein_prev)

After examining the results up to here, I see that whenever the different character is a number, the names are genuinely distinct and I should not correct them. Therefore I was thinking of using

gen char_off_place = indexnot(name,name_prev) if  levstein_prev == 1

and then conditioning my further commands on whether the character off is a number or a letter.

The problem is that indexnot(name,name_prev) doesn't do exactly what I want.
For example:
name is "ABRAXIS BIOSCIENCES INC"
name_prev is "ABRAXIS BIOSCIENCE INC"

in this case, indexnot(string1,string2) will return 0 because the off character (3rd "S" in string1) appears in string2.

It seems like there must be a way to get the position of the off character while observing the order of the characters in string2 Before I give up on Stata and do it in MatLab, can anyone offer a suggestion?

(If there are cases where the first letter of the company name is off, I will deal with this easily later.)

Thank you!
Maria Boutchkova
Lecturer in Finance
University of Edinburgh
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- st: RE: RE: RE: Re: comparing strings off by one character
  - From: Colin Hargreaves <[email protected]>

References:
- st: Re: comparing strings off by one character
  - From: Maria Boutchkova <[email protected]>
- st: RE: Re: comparing strings off by one character
  - From: Joe Canner <[email protected]>

Prev by Date: st: RE: Re: comparing strings off by one character
Next by Date: st: Creating categories of Continuous Variable
Previous by thread: st: RE: Re: comparing strings off by one character
Next by thread: st: RE: RE: RE: Re: comparing strings off by one character
Index(es):
- Date
- Thread