try regular expressions matching, for example
. list
+--------------------------+
| myvar |
|--------------------------|
1. | outside the red lion pub |
2. | red lion |
3. | in the red lyon |
4. | Red Lion |
5. | red Lyon |
|--------------------------|
6. | red loon |
+--------------------------+
. gen found = regexm(myvar, "[r | R]ed [l | L][i | y]?on")
. list
+----------------------------------+
| myvar found |
|----------------------------------|
1. | outside the red lion pub 1 |
2. | red lion 1 |
3. | in the red lyon 1 |
4. | Red Lion 1 |
5. | red Lyon 1 |
|----------------------------------|
6. | red loon 0 |
+----------------------------------+
On Wed, Aug 13, 2008 at 9:17 AM, Simon Moore <[email protected]> wrote:
> Dear Statalist,
>
> I have a string variable that contains values something like this:-
>
> "outside the red lion pub"
> "red lion"
> "in the red lyon"
>
> and so on.
>
> I need to search this variable for names (e.g. "red lion") and would like to
> do so in such a way that overcome the inevitable typo (e.g. "red lyon").
>
> Searching through the statalist archives I have come across, for example:
>
> g pub = 0
> replace pub = 1 if index(lower(var1), "red lion")
>
> But this does not cope well if there's any deviation in spelling. I also
> came across a rather neat routine written by Laura Giuliano that computes
> the Levenshtein distance and goes something like this:
>
> local word1 = "simon"
> local word2 = "slim"
> local L1 = length("`word1'")
> local L2 = length("`word2'")
>
> matrix A=J(`L2'+1, `L1'+1, 0)
> forval i = 0 / `L1' {
> matrix A[1,`i'+1] = `i'
> }
> forval j = 1 / `L2' {
> matrix A[`j'+1,1] = `j'
> }
> forval j = 1 / `L2' {
> forval i = 1 / `L1' {
> if substr("`word2'", `j', 1) == substr("`word1'",
> `i', 1) {
> local cost=0
> }
> else {
> local cost=1
> }
> local m = 1 + A[`j', `i'+1]
> local n = 1 + A[`j'+1, `i']
> local d = `cost' + A[`j', `i']
> matrix A[`j'+1,`i'+1]=min(`m',`n',`d')
> }
> }
> local lev = A[`L2'+1, `L1'+1]
> di "Levenshtein distance between `word1' and `word2' is `lev' "
>
>
> This would be great, except that my string variable has the odd additional,
> and redundant, word thrown in.
>
> So, would anyone happen to know if there's a routine that kind of combines
> both index and Levenshtein to provide some measure of text is definitely or
> nearly definitely in the string variable? For example, a score of 0 if "red
> lion" is present, 1 if "red lyon" is present and so on.
>
> As ever, any guidance greatly appreciated.
>
> Regards
> Simon
>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/