Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Stata analog to Mata's -strdup()- or better approach?

From	Rebecca Pope <[email protected]>
To	[email protected]
Subject	Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Date	Sun, 13 Mar 2011 10:40:07 -0500

On Sun, Mar 13, 2011 at 4:33 AM, Nick Cox <[email protected]> wrote:

> 3. My original code could be speeded up a bit by not using a variable
> X but my guess would be that Robert's is still definitely faster.
>
I should have specified that I altered your code to use the macro you
posted later for the time listed in my previous post. That one change
makes a substantial difference in the speed--just less than half the
time it takes to run with the variable. Even better, it means that I
don't need to drop the other variables in my dataset to complete the
search over all 6 million observations. If you count time to merge the
findings back in the difference is even greater.

On Sun, Mar 13, 2011 at 7:25 AM, Robert Picard <[email protected]> wrote:
> Turns out that finding the longest span can be done faster without
> string manipulations. Here's a new version:
>
> * -------------------------- begin example ----------------
>
> clear all
> input patid str12 estring
> 1          XXXXX-------
> 2          --XXX---XXXX
> 3          -XXXXXX-----
> 4          -XXX-XXX-XXX
> 5          XXXX-XX-XXXX
> 6          X-XX-XX-XXXX
> 7          X-XXXXX-XXXX
> 8          X-XXX---XXX-
> 9          XXXXXXXXXXXX
> 10         ------------
> end
>
> local len = 12
>
> * Find the longest period of continuous eligibility.
> gen maxspan = 0
> gen currspan = 0
> gen isX = 0
> qui forvalue i = 1/`len' {
>        replace isX = substr(estring,`i',1) == "X"
>        replace currspan = currspan + 1 if isX
>        replace maxspan = currspan if !isX & ///
>                currspan > maxspan
>        replace currspan = 0 if !isX
> }
> replace maxspan = currspan if currspan > maxspan
> drop currspan isX
>
> * Identify the start of each span
> gen spanX = substr("`: di _dup(`len') "X"'",1,maxspan)
> gen blanks = subinstr(spanX,"X"," ",.)
> gen es = estring
> local i 0
> local more 1
> qui while `more' {
>        local i = `i' + 1
>        gen where`i' = strpos(es,spanX)
>        replace where`i' = . if where`i' == 0
>        replace es = subinstr(es,spanX,blanks,1)
>        count if where`i' != .
>        local more = r(N)
> }
> drop where`i'
> replace where1 = . if maxspan == 0
> egen nmaxspan = rownonmiss(where*)
> drop es blanks spanX
>
> * -------------------------- end example ------------------

Yup. It reduces total run time by about 3.5 seconds in the 10% sample.

Splitting the code into two functions, (1) finding the longest span of
continuous eligibility and (2) determining where those spans occur
within the 15-year period covered by the data, I get the best
performance by using Robert's method for (1) and Nick's method for
(2). The whole process takes just less than 29 seconds.

Thanks again very much to both of you. I'd still be muddling through
with trial and error without you. I've also learned a lot by looking
at your code. I really appreciate all the help.

Best,
Rebecca

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <[email protected]>

References:
- st: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <[email protected]>
- st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Robert Picard <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Robert Picard <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Robert Picard <[email protected]>

Prev by Date: st: Stset-ing Multiple Failure/Multiple Spell Data : Moving in and out of risk set
Next by Date: Re: st: survival analysis in the presence of competing risks and multi-level data
Previous by thread: Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Next by thread: Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Index(es):
- Date
- Thread