Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: RE: Re: comparing strings off by one character
From
Joe Canner <[email protected]>
To
"[email protected]" <[email protected]>
Subject
st: RE: Re: comparing strings off by one character
Date
Wed, 12 Mar 2014 20:57:45 +0000
Maria,
I don't know much about -strgroup-, but it looked interesting so I tried to learn more....
It looks like -strgroup- can group observations based on their Levenshtein distance (given a certain threshold) and assign each set of matches a unique number. I wonder why you couldn't just use that number to identify companies from here on, instead of having to fix the names so that they match?
Regards,
Joe Canner
Johns Hopkins University School of Medicine
-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Maria Boutchkova
Sent: Wednesday, March 12, 2014 3:39 PM
To: statalist
Subject: st: Re: comparing strings off by one character
Dear Statalisters,
I am dealing with the following problem under the general rubric of
string comparison.
I have a string variable name containing company names over 17 years.
Sometimes for some years the data entry person has made a typo and the
company name is off from the way it is entered the rest of the years
by one character. Initial collapsing by name results in close to 100K
unique names, therefore automation is a must.
This post has been very helpful to me so far, but I am not quite there yet.
http://www.stata.com/statalist/archive/2012-03/msg01135.html
Here is what I have so far:
collapse (first) v1 v2 (max) v3, by(name)
sort name
gen name_prev = name[_n-1] if _n > 1
order name name_prev
levenshtein name name_prev, gen(levstein_prev)
After examining the results up to here, I see that whenever the
different character is a number, the names are genuinely distinct and
I should not correct them. Therefore I was thinking of using
gen char_off_place = indexnot(name,name_prev) if levstein_prev == 1
and then conditioning my further commands on whether the character off
is a number or a letter.
The problem is that indexnot(name,name_prev) doesn't do exactly what I want.
For example:
name is "ABRAXIS BIOSCIENCES INC"
name_prev is "ABRAXIS BIOSCIENCE INC"
in this case, indexnot(string1,string2) will return 0 because the off
character (3rd "S" in string1) appears in string2.
It seems like there must be a way to get the position of the off
character while observing the order of the characters in string2
Before I give up on Stata and do it in MatLab, can anyone offer a
suggestion?
(If there are cases where the first letter of the company name is off,
I will deal with this easily later.)
Thank you!
Maria Boutchkova
Lecturer in Finance
University of Edinburgh
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/