Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
From
Rebecca Pope <[email protected]>
To
[email protected]
Subject
Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Date
Sun, 13 Mar 2011 14:45:39 -0500
A quick question about optimizing processing speed for this routine:
Should the speed slow considerably with temporary variables? Because
it is my habit to have temporary variables when I do not intend to
keep them, I changed Robert's code to use -tempvar- instead of
creating the "isX" and "currspan" variables and them dropping them.
The processing time increased from 21 to 72 seconds. Note: "maxspan"
renamed "contelig" in my code to be consistent with the rest of my
program.
*** Robert's Original Code ***
timer on 5
gen maxspan = 0
gen currspan = 0
gen isX = 0
qui forvalue i = 1/`len' {
replace isX = substr(estring,`i',1) == "X"
replace currspan = currspan + 1 if isX
replace maxspan = currspan if !isX & ///
currspan > maxspan
replace currspan = 0 if !isX
}
replace maxspan = currspan if currspan > maxspan
drop currspan isX
timer off 5
*** My modified code ***
gen int contelig = 0
label var contelig "Longest Period of Continuous Enrollment"
note contelig: Number of months in longest set of Xs from 'estring'
tempvar isX currelig n_longest
timer on 1
gen int `currspan' = 0
gen byte `isX' = 0
qui forvalues i = 1/`len' {
replace `isX' = substr(estring,`i',1) == "X"
replace `currspan' = `currspan' + 1 if `isX'
replace contelig = `currspan' if !`isX' & ///
`currspan' > contelig
replace `currspan' = 0 if !`isX'
}
replace contelig = `currelig' if `currelig' > contelig
timer off 1
*-------end of code snippets
timer list
1: 71.94 / 1 = 71.9390
5: 21.01 / 1 = 21.0060
Best,
Rebecca
On Sun, Mar 13, 2011 at 10:40 AM, Rebecca Pope <[email protected]> wrote:
> On Sun, Mar 13, 2011 at 4:33 AM, Nick Cox <[email protected]> wrote:
>
>> 3. My original code could be speeded up a bit by not using a variable
>> X but my guess would be that Robert's is still definitely faster.
>>
> I should have specified that I altered your code to use the macro you
> posted later for the time listed in my previous post. That one change
> makes a substantial difference in the speed--just less than half the
> time it takes to run with the variable. Even better, it means that I
> don't need to drop the other variables in my dataset to complete the
> search over all 6 million observations. If you count time to merge the
> findings back in the difference is even greater.
>
> On Sun, Mar 13, 2011 at 7:25 AM, Robert Picard <[email protected]> wrote:
>> Turns out that finding the longest span can be done faster without
>> string manipulations. Here's a new version:
>>
>> * -------------------------- begin example ----------------
>>
>> clear all
>> input patid str12 estring
>> 1 XXXXX-------
>> 2 --XXX---XXXX
>> 3 -XXXXXX-----
>> 4 -XXX-XXX-XXX
>> 5 XXXX-XX-XXXX
>> 6 X-XX-XX-XXXX
>> 7 X-XXXXX-XXXX
>> 8 X-XXX---XXX-
>> 9 XXXXXXXXXXXX
>> 10 ------------
>> end
>>
>> local len = 12
>>
>> * Find the longest period of continuous eligibility.
>> gen maxspan = 0
>> gen currspan = 0
>> gen isX = 0
>> qui forvalue i = 1/`len' {
>> replace isX = substr(estring,`i',1) == "X"
>> replace currspan = currspan + 1 if isX
>> replace maxspan = currspan if !isX & ///
>> currspan > maxspan
>> replace currspan = 0 if !isX
>> }
>> replace maxspan = currspan if currspan > maxspan
>> drop currspan isX
>>
>> * Identify the start of each span
>> gen spanX = substr("`: di _dup(`len') "X"'",1,maxspan)
>> gen blanks = subinstr(spanX,"X"," ",.)
>> gen es = estring
>> local i 0
>> local more 1
>> qui while `more' {
>> local i = `i' + 1
>> gen where`i' = strpos(es,spanX)
>> replace where`i' = . if where`i' == 0
>> replace es = subinstr(es,spanX,blanks,1)
>> count if where`i' != .
>> local more = r(N)
>> }
>> drop where`i'
>> replace where1 = . if maxspan == 0
>> egen nmaxspan = rownonmiss(where*)
>> drop es blanks spanX
>>
>> * -------------------------- end example ------------------
>
> Yup. It reduces total run time by about 3.5 seconds in the 10% sample.
>
> Splitting the code into two functions, (1) finding the longest span of
> continuous eligibility and (2) determining where those spans occur
> within the 15-year period covered by the data, I get the best
> performance by using Robert's method for (1) and Nick's method for
> (2). The whole process takes just less than 29 seconds.
>
> Thanks again very much to both of you. I'd still be muddling through
> with trial and error without you. I've also learned a lot by looking
> at your code. I really appreciate all the help.
>
> Best,
> Rebecca
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/