Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Stata analog to Mata's -strdup()- or better approach?


From   Rebecca Pope <[email protected]>
To   [email protected]
Subject   Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Date   Sun, 13 Mar 2011 10:40:07 -0500

On Sun, Mar 13, 2011 at 4:33 AM, Nick Cox <[email protected]> wrote:

> 3. My original code could be speeded up a bit by not using a variable
> X but my guess would be that Robert's is still definitely faster.
>
I should have specified that I altered your code to use the macro you
posted later for the time listed in my previous post. That one change
makes a substantial difference in the speed--just less than half the
time it takes to run with the variable. Even better, it means that I
don't need to drop the other variables in my dataset to complete the
search over all 6 million observations. If you count time to merge the
findings back in the difference is even greater.

On Sun, Mar 13, 2011 at 7:25 AM, Robert Picard <[email protected]> wrote:
> Turns out that finding the longest span can be done faster without
> string manipulations. Here's a new version:
>
> * -------------------------- begin example ----------------
>
> clear all
> input patid str12 estring
> 1          XXXXX-------
> 2          --XXX---XXXX
> 3          -XXXXXX-----
> 4          -XXX-XXX-XXX
> 5          XXXX-XX-XXXX
> 6          X-XX-XX-XXXX
> 7          X-XXXXX-XXXX
> 8          X-XXX---XXX-
> 9          XXXXXXXXXXXX
> 10         ------------
> end
>
> local len = 12
>
> * Find the longest period of continuous eligibility.
> gen maxspan = 0
> gen currspan = 0
> gen isX = 0
> qui forvalue i = 1/`len' {
>        replace isX = substr(estring,`i',1) == "X"
>        replace currspan = currspan + 1 if isX
>        replace maxspan = currspan if !isX & ///
>                currspan > maxspan
>        replace currspan = 0 if !isX
> }
> replace maxspan = currspan if currspan > maxspan
> drop currspan isX
>
> * Identify the start of each span
> gen spanX = substr("`: di _dup(`len') "X"'",1,maxspan)
> gen blanks = subinstr(spanX,"X"," ",.)
> gen es = estring
> local i 0
> local more 1
> qui while `more' {
>        local i = `i' + 1
>        gen where`i' = strpos(es,spanX)
>        replace where`i' = . if where`i' == 0
>        replace es = subinstr(es,spanX,blanks,1)
>        count if where`i' != .
>        local more = r(N)
> }
> drop where`i'
> replace where1 = . if maxspan == 0
> egen nmaxspan = rownonmiss(where*)
> drop es blanks spanX
>
> * -------------------------- end example ------------------

Yup. It reduces total run time by about 3.5 seconds in the 10% sample.

Splitting the code into two functions, (1) finding the longest span of
continuous eligibility and (2) determining where those spans occur
within the 15-year period covered by the data, I get the best
performance by using Robert's method for (1) and Nick's method for
(2). The whole process takes just less than 29 seconds.

Thanks again very much to both of you. I'd still be muddling through
with trial and error without you. I've also learned a lot by looking
at your code. I really appreciate all the help.

Best,
Rebecca

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index