Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Stata analog to Mata's -strdup()- or better approach?

From	Robert Picard <[email protected]>
To	[email protected]
Subject	Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Date	Mon, 14 Mar 2011 09:06:25 +0100

I'm traveling so I don't have time to look into this right now but I
suspect that the timing differences are due to the use of a more
compact data type, in particular for your temporary `isX'. In 32 and
64 bit computers, fetching a byte requires more work than a real or a
long.

On Mon, Mar 14, 2011 at 2:30 AM, Rebecca Pope <[email protected]> wrote:
> I should add that since I posted the question about -tempvar- earlier,
> I've stepped through each piece of the code, isolating the changes.
>
> Here is an inventory of all the differences in code that I could find:
> 1. I use a different variable name for the permanent variable
> ("contelig" instead of "maxspan")
> 2. "contelig" had a variable label and notes attached to it while
> "maxspan" did not
> 3. I define the variable type when I create them while Robert's code
> uses the system default
> 4. I use temporary variables
>
> I closed down Stata, reopened it, and copied Robert's original code
> into a new do-file editor. I changed each piece at a time, starting by
> using find & replace to change "maxspan" to "contelig" (paranoid, I
> know). Then I ran the code w/ timer again. No change in time... I went
> down the list above & lost a few tenths of a second when I added the
> label & notes. No change for (3). And then a big hit on (4). The
> difference was not quite as extreme as what I posted earlier, but
> still there.
>
> Thanks,
> Rebecca
>
>
>
>          __o                __o
>       _`\ <,_            _`\ <,_
>      (_)/   (_)          (_)/   (_)
> =========================
>
>
>
> On Sun, Mar 13, 2011 at 8:15 PM, Rebecca Pope <[email protected]> wrote:
>> Sorry. I renamed Robert's "currspan" to "currelig" (using find &
>> replace) just to have the terminology consistent. When I copied my
>> code over here, did another F&R, so that it would be consistent with
>> Robert's just above, but I apparently didn't highlight far enough
>> down.
>>
>> Rebecca
>>
>>
>>
>>          __o                __o
>>       _`\ <,_            _`\ <,_
>>      (_)/   (_)          (_)/   (_)
>> =========================
>>
>>
>>
>> On Sun, Mar 13, 2011 at 7:40 PM, Nick Cox <[email protected]> wrote:
>>> You refer to a temporary variable -currelig-. Where do you define it?
>>>
>>> Nick
>>>
>>> On Sun, Mar 13, 2011 at 7:45 PM, Rebecca Pope <[email protected]> wrote:
>>>> A quick question about optimizing processing speed for this routine:
>>>> Should the speed slow considerably with temporary variables? Because
>>>> it is my habit to have temporary variables when I do not intend to
>>>> keep them, I changed Robert's code to use -tempvar- instead of
>>>> creating the "isX" and "currspan" variables and them dropping them.
>>>> The processing time increased from 21 to 72 seconds. Note: "maxspan"
>>>> renamed "contelig" in my code to be consistent with the rest of my
>>>> program.
>>>>
>>>> *** Robert's Original Code ***
>>>> timer on 5
>>>> gen maxspan = 0
>>>> gen currspan = 0
>>>> gen isX = 0
>>>> qui forvalue i = 1/`len' {
>>>>       replace isX = substr(estring,`i',1) == "X"
>>>>       replace currspan = currspan + 1 if isX
>>>>       replace maxspan = currspan if !isX & ///
>>>>               currspan > maxspan
>>>>       replace currspan = 0 if !isX
>>>> }
>>>> replace maxspan = currspan if currspan > maxspan
>>>> drop currspan isX
>>>> timer off 5
>>>>
>>>> *** My modified code ***
>>>> gen int contelig = 0
>>>> label var contelig "Longest Period of Continuous Enrollment"
>>>>        note contelig: Number of months in longest set of Xs from 'estring'
>>>>
>>>> tempvar isX currelig n_longest
>>>> timer on 1
>>>> gen int `currspan' = 0
>>>> gen byte `isX' = 0
>>>>
>>>> qui forvalues i = 1/`len' {
>>>>       replace `isX' = substr(estring,`i',1) == "X"
>>>>       replace `currspan' = `currspan' + 1 if `isX'
>>>>       replace contelig = `currspan' if !`isX' & ///
>>>>               `currspan' > contelig
>>>>       replace `currspan' = 0 if !`isX'
>>>> }
>>>> replace contelig = `currelig' if `currelig' > contelig
>>>> timer off 1
>>>>
>>>> *-------end of code snippets
>>>>
>>>>  timer list
>>>>   1:     71.94 /        1 =      71.9390
>>>>   5:     21.01 /        1 =      21.0060
>>>>
>>>> Best,
>>>> Rebecca
>>>>
>>>> On Sun, Mar 13, 2011 at 10:40 AM, Rebecca Pope <[email protected]> wrote:
>>>>> On Sun, Mar 13, 2011 at 4:33 AM, Nick Cox <[email protected]> wrote:
>>>>>
>>>>>> 3. My original code could be speeded up a bit by not using a variable
>>>>>> X but my guess would be that Robert's is still definitely faster.
>>>>>>
>>>>> I should have specified that I altered your code to use the macro you
>>>>> posted later for the time listed in my previous post. That one change
>>>>> makes a substantial difference in the speed--just less than half the
>>>>> time it takes to run with the variable. Even better, it means that I
>>>>> don't need to drop the other variables in my dataset to complete the
>>>>> search over all 6 million observations. If you count time to merge the
>>>>> findings back in the difference is even greater.
>>>>>
>>>>> On Sun, Mar 13, 2011 at 7:25 AM, Robert Picard <[email protected]> wrote:
>>>>>> Turns out that finding the longest span can be done faster without
>>>>>> string manipulations. Here's a new version:
>>>>>>
>>>>>> * -------------------------- begin example ----------------
>>>>>>
>>>>>> clear all
>>>>>> input patid str12 estring
>>>>>> 1          XXXXX-------
>>>>>> 2          --XXX---XXXX
>>>>>> 3          -XXXXXX-----
>>>>>> 4          -XXX-XXX-XXX
>>>>>> 5          XXXX-XX-XXXX
>>>>>> 6          X-XX-XX-XXXX
>>>>>> 7          X-XXXXX-XXXX
>>>>>> 8          X-XXX---XXX-
>>>>>> 9          XXXXXXXXXXXX
>>>>>> 10         ------------
>>>>>> end
>>>>>>
>>>>>> local len = 12
>>>>>>
>>>>>> * Find the longest period of continuous eligibility.
>>>>>> gen maxspan = 0
>>>>>> gen currspan = 0
>>>>>> gen isX = 0
>>>>>> qui forvalue i = 1/`len' {
>>>>>>        replace isX = substr(estring,`i',1) == "X"
>>>>>>        replace currspan = currspan + 1 if isX
>>>>>>        replace maxspan = currspan if !isX & ///
>>>>>>                currspan > maxspan
>>>>>>        replace currspan = 0 if !isX
>>>>>> }
>>>>>> replace maxspan = currspan if currspan > maxspan
>>>>>> drop currspan isX
>>>>>>
>>>>>> * Identify the start of each span
>>>>>> gen spanX = substr("`: di _dup(`len') "X"'",1,maxspan)
>>>>>> gen blanks = subinstr(spanX,"X"," ",.)
>>>>>> gen es = estring
>>>>>> local i 0
>>>>>> local more 1
>>>>>> qui while `more' {
>>>>>>        local i = `i' + 1
>>>>>>        gen where`i' = strpos(es,spanX)
>>>>>>        replace where`i' = . if where`i' == 0
>>>>>>        replace es = subinstr(es,spanX,blanks,1)
>>>>>>        count if where`i' != .
>>>>>>        local more = r(N)
>>>>>> }
>>>>>> drop where`i'
>>>>>> replace where1 = . if maxspan == 0
>>>>>> egen nmaxspan = rownonmiss(where*)
>>>>>> drop es blanks spanX
>>>>>>
>>>>>> * -------------------------- end example ------------------
>>>>>
>>>>> Yup. It reduces total run time by about 3.5 seconds in the 10% sample.
>>>>>
>>>>> Splitting the code into two functions, (1) finding the longest span of
>>>>> continuous eligibility and (2) determining where those spans occur
>>>>> within the 15-year period covered by the data, I get the best
>>>>> performance by using Robert's method for (1) and Nick's method for
>>>>> (2). The whole process takes just less than 29 seconds.
>>>>>
>>>>> Thanks again very much to both of you. I'd still be muddling through
>>>>> with trial and error without you. I've also learned a lot by looking
>>>>> at your code. I really appreciate all the help.
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/statalist/faq
>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: David Elliott <[email protected]>

References:
- st: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <[email protected]>
- st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Robert Picard <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Robert Picard <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Robert Picard <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <[email protected]>

Prev by Date: Re: st: interpretation of impulse response function results
Next by Date: Re: st: Using maximum likelihood estimation (ml) for a nonlinear function
Previous by thread: Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Next by thread: Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Index(es):
- Date
- Thread