Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Stata analog to Mata's -strdup()- or better approach?

From	Nick Cox <njcoxstata@gmail.com>
To	statalist@hsphsun2.harvard.edu
Subject	Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Date	Sat, 12 Mar 2011 12:02:40 +0000

You're correct. The code below fixes the error I found on closer examination.

The incorrect line used a replacement mask was n_longest long; it
should have been l_longest.

Thanks for checking.

clear
input patid str12 estring
1          XXXXX-------
2          --XXX---XXXX
3          -XXXXXX-----
4          -XXX-XXX-XXX
5          XXXX-XX-XXXX
end

gen X = ""
gen l_longest = 0
gen s_longest = ""
gen where1 = 0

qui forval i = 1/12 {
       replace X = X + "X"
	 replace s_longest = X if strpos(estring, X)
       replace l_longest = `i' if strpos(estring, X)
       replace where1 = strpos(estring, X) if strpos(estring, X)
}

drop X
gen n_longest = ///
(length(estring) - length(subinstr(estring, s_longest, "", .))) / ///
length(s_longest)

clonevar copy = estring
local mask : di _dup(12) "&"
local rtext subinstr(copy, s_longest, substr("`mask'", 1, l_longest), 1)
replace copy = `rtext'

su n_longest, meanonly
forval j = 2/`r(max)' {
	gen where`j' = strpos(copy, s_longest)
	replace copy = `rtext'
} 	



On Sat, Mar 12, 2011 at 11:47 AM, Robert Picard <picard@netbox.com> wrote:
> Nick, I think that there's a problem with your code, it does not work
> with a string like:
>
>  "XXXX-XX-XXXX"
>
> Here's how I would do it:
>
> * -------------------------- begin example ----------------
>
> clear
> input patid str12 estring
> 1          XXXXX-------
> 2          --XXX---XXXX
> 3          -XXXXXX-----
> 4          -XXX-XXX-XXX
> 5          XXXX-XX-XXXX
> 6          X-XX-XX-XXXX
> 6          X-XXXXX-XXXX
> end
>
> * Find the longest period of continuous eligibility
> clonevar es = estring
> gen maxspan = ""
> local more 1
> while `more' {
>        gen s = regexs(1) if regexm(es,"(X+)")
>        replace maxspan = s if length(s) > length(maxspan)
>        replace es = subinstr(es,s,"",1)
>        count if s != ""
>        local more = r(N)
>        drop s
> }
>
>
> * Identify the start of each span
> gen smask = subinstr(maxspan,"X","_",.)
> replace es = estring
> local i 0
> local more 1
> while `more' {
>        local i = `i' + 1
>        gen where`i' = strpos(es,maxspan)
>        replace where`i' = . if where`i' == 0
>        replace es = subinstr(es,maxspan,smask,1)
>        count if where`i' != .
>        local more = r(N)
> }
> drop where`i'
> egen nmaxspan = rownonmiss(where*)
> drop es smask
>
> * -------------------------- end example ------------------
>
>
>
> On Sat, Mar 12, 2011 at 11:23 AM, Nick Cox <njcoxstata@gmail.com> wrote:
>> First, let me give a more complete example of how I would approach
>> your problem.
>>
>> 1. Your example data.
>>
>> clear
>> input patid str12 estring
>> 1          XXXXX-------
>> 2          --XXX---XXXX
>> 3          -XXXXXX-----
>> 4          -XXX-XXX-XXX
>> end
>>
>> 2. Sample script starts with initialisations. Clearly, 12 is specific
>> to the example.
>>
>> gen X = ""
>> gen l_longest = 0
>> gen s_longest = ""
>> gen where1 = 0
>>
>> 3. The main loop just tries out longer multiples of "X" until it finds
>> the longest.
>>
>> qui forval i = 1/12 {
>>       replace X = X + "X"
>>         replace s_longest = X if strpos(estring, X)
>>       replace l_longest = `i' if strpos(estring, X)
>>       replace where1 = strpos(estring, X) if strpos(estring, X)
>> }
>>
>> drop X
>>
>> 4. The number of times the longest substring occurs is calculated from
>> a comparison of length before and after (notionally) blanking it out.
>> There is more on this trick at Mitch Abdon's blog
>>
>> <http://statadaily.wordpress.com/2011/01/20/counting-occurrence-of-strings-within-strings/>
>>
>> and in my Speaking Stata column in SJ 11(1) 2011.
>>
>> gen n_longest = ///
>> (length(estring) - length(subinstr(estring, s_longest, "", .))) / ///
>> length(s_longest)
>>
>> 5. Now to find the separate occurrences of the longest substring we
>> look for each one in a copy, and everytime we do find it one we
>> replace it with a mask of the same length. "&" is arbitrary here.
>>
>> clonevar copy = estring
>> local mask : di _dup(12) "&"
>> local rtext subinstr(copy, s_longest, substr("`mask'", 1, n_longest), 1)
>> replace copy = `rtext'
>>
>> su n_longest, meanonly
>> forval j = 2/`r(max)' {
>>        gen where`j' = strpos(copy, s_longest)
>>        replace copy = `rtext'
>> }
>>
>> Part of the Stata magic is that what the longest substring is, how
>> many times it occurs, and its length can easily vary from observation
>> to observation.
>>
>> Here is all the code as one segment
>>
>> clear
>> input patid str12 estring
>> 1          XXXXX-------
>> 2          --XXX---XXXX
>> 3          -XXXXXX-----
>> 4          -XXX-XXX-XXX
>> end
>>
>> gen X = ""
>> gen l_longest = 0
>> gen s_longest = ""
>> gen where1 = 0
>>
>> qui forval i = 1/12 {
>>       replace X = X + "X"
>>         replace s_longest = X if strpos(estring, X)
>>       replace l_longest = `i' if strpos(estring, X)
>>       replace where1 = strpos(estring, X) if strpos(estring, X)
>> }
>>
>> drop X
>> gen n_longest = ///
>> (length(estring) - length(subinstr(estring, s_longest, "", .))) / ///
>> length(s_longest)
>>
>> clonevar copy = estring
>> local mask : di _dup(12) "&"
>> local rtext subinstr(copy, s_longest, substr("`mask'", 1, n_longest), 1)
>> replace copy = `rtext'
>>
>> su n_longest, meanonly
>> forval j = 2/`r(max)' {
>>        gen where`j' = strpos(copy, s_longest)
>>        replace copy = `rtext'
>> }
>>
>> Now: commenting on -split-. The approach above seems closer to what
>> you want than using -split-.
>>
>> -split- treats multiple spaces as one, but otherwise does not treat
>> multiple occurrences of other delimiters as equivalent to one
>> occurrence. That is why I wrote
>>
>> replace fstring = subinstr(fstring, "-", " ", .)
>>
>> You will find that
>>
>> split estring, parse(-)
>>
>> creates rather too many variables to be useful.
>>
>> Nick
>>
>> On Sat, Mar 12, 2011 at 2:51 AM, Rebecca Pope <rebecca.a.pope@gmail.com> wrote:
>>> Nick,
>>> I had to read what you wrote a couple of times before the "Duh" kicked
>>> in. In one of my many attempts, I did (nearly) exactly what you wrote
>>> below. The real difference, which I didn't catch at first, is that you
>>> don't condense the spaces into a single space like I did. -split- will
>>> create a new variable for each " ", thereby preserving where the
>>> string started. For subsequent instances of variables including Xs,
>>> the index on the variable generated by -split- will be off, but I
>>> could just add the length of the preceding variables. Brilliant! (you,
>>> not me)
>>>
>>> In the interest of full disclosure, I'm rather ashamed to admit that I
>>> initially used -split- exactly as you do and cursed at it for not
>>> recognizing multiple delimiters as one, went back and condensed the
>>> multiple spaces to a single space, and then -split- the variable
>>> again. In fact, my initial reaction to your e-mail was "Did that;
>>> doesn't work." I suppose "obtuse" does apply. Sorry for the trouble.
>>>
>>> Unless I'm missing something else, I could just use a - split estring,
>>> parse(-) -, correct?
>>>
>>> Thanks again for all the help,
>>> Rebecca
>>>
>>> On Fri, Mar 11, 2011 at 6:32 PM, Nick Cox <njcoxstata@gmail.com> wrote:
>>>>
>>>> Have you thought of something like
>>>>
>>>> clonevar fstring = estring
>>>> replace fstring = subinstr(fstring, "-", " ", .)
>>>> split fstring
>>>>
>>> <truncated>
>> *

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Robert Picard <picard@netbox.com>

References:
- st: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <rebecca.a.pope@gmail.com>
- st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <n.j.cox@durham.ac.uk>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <njcoxstata@gmail.com>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <rebecca.a.pope@gmail.com>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <njcoxstata@gmail.com>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <rebecca.a.pope@gmail.com>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <njcoxstata@gmail.com>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Robert Picard <picard@netbox.com>

Prev by Date: Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Next by Date: RE: st: doubt on the output format %w.dg
Previous by thread: Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Next by thread: Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Index(es):
- Date
- Thread