|
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
st: String search
Dear Statalist,
I have a string variable that contains values something like this:-
"outside the red lion pub"
"red lion"
"in the red lyon"
and so on.
I need to search this variable for names (e.g. "red lion") and would
like to do so in such a way that overcome the inevitable typo (e.g. "red
lyon").
Searching through the statalist archives I have come across, for example:
g pub = 0
replace pub = 1 if index(lower(var1), "red lion")
But this does not cope well if there's any deviation in spelling. I
also came across a rather neat routine written by Laura Giuliano that
computes the Levenshtein distance and goes something like this:
local word1 = "simon"
local word2 = "slim"
local L1 = length("`word1'")
local L2 = length("`word2'")
matrix A=J(`L2'+1, `L1'+1, 0)
forval i = 0 / `L1' {
matrix A[1,`i'+1] = `i'
}
forval j = 1 / `L2' {
matrix A[`j'+1,1] = `j'
}
forval j = 1 / `L2' {
forval i = 1 / `L1' {
if substr("`word2'", `j', 1) == substr("`word1'", `i', 1) {
local cost=0
}
else {
local cost=1
}
local m = 1 + A[`j', `i'+1]
local n = 1 + A[`j'+1, `i']
local d = `cost' + A[`j', `i']
matrix A[`j'+1,`i'+1]=min(`m',`n',`d')
}
}
local lev = A[`L2'+1, `L1'+1]
di "Levenshtein distance between `word1' and `word2' is `lev' "
This would be great, except that my string variable has the odd
additional, and redundant, word thrown in.
So, would anyone happen to know if there's a routine that kind of
combines both index and Levenshtein to provide some measure of text is
definitely or nearly definitely in the string variable? For example, a
score of 0 if "red lion" is present, 1 if "red lyon" is present and so on.
As ever, any guidance greatly appreciated.
Regards
Simon
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/