Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <njcoxstata@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: RE: Stata analog to Mata's -strdup()- or better approach? |
Date | Sat, 12 Mar 2011 12:02:40 +0000 |
You're correct. The code below fixes the error I found on closer examination. The incorrect line used a replacement mask was n_longest long; it should have been l_longest. Thanks for checking. clear input patid str12 estring 1 XXXXX------- 2 --XXX---XXXX 3 -XXXXXX----- 4 -XXX-XXX-XXX 5 XXXX-XX-XXXX end gen X = "" gen l_longest = 0 gen s_longest = "" gen where1 = 0 qui forval i = 1/12 { replace X = X + "X" replace s_longest = X if strpos(estring, X) replace l_longest = `i' if strpos(estring, X) replace where1 = strpos(estring, X) if strpos(estring, X) } drop X gen n_longest = /// (length(estring) - length(subinstr(estring, s_longest, "", .))) / /// length(s_longest) clonevar copy = estring local mask : di _dup(12) "&" local rtext subinstr(copy, s_longest, substr("`mask'", 1, l_longest), 1) replace copy = `rtext' su n_longest, meanonly forval j = 2/`r(max)' { gen where`j' = strpos(copy, s_longest) replace copy = `rtext' } On Sat, Mar 12, 2011 at 11:47 AM, Robert Picard <picard@netbox.com> wrote: > Nick, I think that there's a problem with your code, it does not work > with a string like: > > "XXXX-XX-XXXX" > > Here's how I would do it: > > * -------------------------- begin example ---------------- > > clear > input patid str12 estring > 1 XXXXX------- > 2 --XXX---XXXX > 3 -XXXXXX----- > 4 -XXX-XXX-XXX > 5 XXXX-XX-XXXX > 6 X-XX-XX-XXXX > 6 X-XXXXX-XXXX > end > > * Find the longest period of continuous eligibility > clonevar es = estring > gen maxspan = "" > local more 1 > while `more' { > gen s = regexs(1) if regexm(es,"(X+)") > replace maxspan = s if length(s) > length(maxspan) > replace es = subinstr(es,s,"",1) > count if s != "" > local more = r(N) > drop s > } > > > * Identify the start of each span > gen smask = subinstr(maxspan,"X","_",.) > replace es = estring > local i 0 > local more 1 > while `more' { > local i = `i' + 1 > gen where`i' = strpos(es,maxspan) > replace where`i' = . if where`i' == 0 > replace es = subinstr(es,maxspan,smask,1) > count if where`i' != . > local more = r(N) > } > drop where`i' > egen nmaxspan = rownonmiss(where*) > drop es smask > > * -------------------------- end example ------------------ > > > > On Sat, Mar 12, 2011 at 11:23 AM, Nick Cox <njcoxstata@gmail.com> wrote: >> First, let me give a more complete example of how I would approach >> your problem. >> >> 1. Your example data. >> >> clear >> input patid str12 estring >> 1 XXXXX------- >> 2 --XXX---XXXX >> 3 -XXXXXX----- >> 4 -XXX-XXX-XXX >> end >> >> 2. Sample script starts with initialisations. Clearly, 12 is specific >> to the example. >> >> gen X = "" >> gen l_longest = 0 >> gen s_longest = "" >> gen where1 = 0 >> >> 3. The main loop just tries out longer multiples of "X" until it finds >> the longest. >> >> qui forval i = 1/12 { >> replace X = X + "X" >> replace s_longest = X if strpos(estring, X) >> replace l_longest = `i' if strpos(estring, X) >> replace where1 = strpos(estring, X) if strpos(estring, X) >> } >> >> drop X >> >> 4. The number of times the longest substring occurs is calculated from >> a comparison of length before and after (notionally) blanking it out. >> There is more on this trick at Mitch Abdon's blog >> >> <http://statadaily.wordpress.com/2011/01/20/counting-occurrence-of-strings-within-strings/> >> >> and in my Speaking Stata column in SJ 11(1) 2011. >> >> gen n_longest = /// >> (length(estring) - length(subinstr(estring, s_longest, "", .))) / /// >> length(s_longest) >> >> 5. Now to find the separate occurrences of the longest substring we >> look for each one in a copy, and everytime we do find it one we >> replace it with a mask of the same length. "&" is arbitrary here. >> >> clonevar copy = estring >> local mask : di _dup(12) "&" >> local rtext subinstr(copy, s_longest, substr("`mask'", 1, n_longest), 1) >> replace copy = `rtext' >> >> su n_longest, meanonly >> forval j = 2/`r(max)' { >> gen where`j' = strpos(copy, s_longest) >> replace copy = `rtext' >> } >> >> Part of the Stata magic is that what the longest substring is, how >> many times it occurs, and its length can easily vary from observation >> to observation. >> >> Here is all the code as one segment >> >> clear >> input patid str12 estring >> 1 XXXXX------- >> 2 --XXX---XXXX >> 3 -XXXXXX----- >> 4 -XXX-XXX-XXX >> end >> >> gen X = "" >> gen l_longest = 0 >> gen s_longest = "" >> gen where1 = 0 >> >> qui forval i = 1/12 { >> replace X = X + "X" >> replace s_longest = X if strpos(estring, X) >> replace l_longest = `i' if strpos(estring, X) >> replace where1 = strpos(estring, X) if strpos(estring, X) >> } >> >> drop X >> gen n_longest = /// >> (length(estring) - length(subinstr(estring, s_longest, "", .))) / /// >> length(s_longest) >> >> clonevar copy = estring >> local mask : di _dup(12) "&" >> local rtext subinstr(copy, s_longest, substr("`mask'", 1, n_longest), 1) >> replace copy = `rtext' >> >> su n_longest, meanonly >> forval j = 2/`r(max)' { >> gen where`j' = strpos(copy, s_longest) >> replace copy = `rtext' >> } >> >> Now: commenting on -split-. The approach above seems closer to what >> you want than using -split-. >> >> -split- treats multiple spaces as one, but otherwise does not treat >> multiple occurrences of other delimiters as equivalent to one >> occurrence. That is why I wrote >> >> replace fstring = subinstr(fstring, "-", " ", .) >> >> You will find that >> >> split estring, parse(-) >> >> creates rather too many variables to be useful. >> >> Nick >> >> On Sat, Mar 12, 2011 at 2:51 AM, Rebecca Pope <rebecca.a.pope@gmail.com> wrote: >>> Nick, >>> I had to read what you wrote a couple of times before the "Duh" kicked >>> in. In one of my many attempts, I did (nearly) exactly what you wrote >>> below. The real difference, which I didn't catch at first, is that you >>> don't condense the spaces into a single space like I did. -split- will >>> create a new variable for each " ", thereby preserving where the >>> string started. For subsequent instances of variables including Xs, >>> the index on the variable generated by -split- will be off, but I >>> could just add the length of the preceding variables. Brilliant! (you, >>> not me) >>> >>> In the interest of full disclosure, I'm rather ashamed to admit that I >>> initially used -split- exactly as you do and cursed at it for not >>> recognizing multiple delimiters as one, went back and condensed the >>> multiple spaces to a single space, and then -split- the variable >>> again. In fact, my initial reaction to your e-mail was "Did that; >>> doesn't work." I suppose "obtuse" does apply. Sorry for the trouble. >>> >>> Unless I'm missing something else, I could just use a - split estring, >>> parse(-) -, correct? >>> >>> Thanks again for all the help, >>> Rebecca >>> >>> On Fri, Mar 11, 2011 at 6:32 PM, Nick Cox <njcoxstata@gmail.com> wrote: >>>> >>>> Have you thought of something like >>>> >>>> clonevar fstring = estring >>>> replace fstring = subinstr(fstring, "-", " ", .) >>>> split fstring >>>> >>> <truncated> >> * * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/