Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Improving code speed

From	Nick Cox <[email protected]>
To	"[email protected]" <[email protected]>
Subject	Re: st: Improving code speed
Date	Thu, 23 May 2013 09:31:27 +0100

Thanks for confirmation of mistakes.

So, here is an extra comment as you are concerned about cleaning up
your code and speeding it up. I can see no reason at all for copying
variable -epsilon- to matrix -epsilon- using -mkmat-. You have a
variable -epsilon- that should be fine for your purpose. In fact that
was I suggested yesterday.

If you post revised, complete and correct code you may get further
comments, but no definite promises from me.
Nick
[email protected]


On 23 May 2013 09:01, Luis Aguiar <[email protected]> wrote:
> Hi George and Nick,
>
> Thanks a lot for your responses.
>
> Nick: Yes, the code works but there are indeed a couple of mistakes in
> the one I copied. I actually use the command " mkmat epsilon if
> id<=`reps' " after generating the variable epsilon (id is indeed _n in
> a variable). The second line in your comment (b) should read " replace
> x`i'=z + epsilon[`i',1] if id==`k'  " . Sorry about all that. Thanks
> for your helpful comments though, I will try to incorporate them into
> my code along with George's comments in order to speed things up.
>
> George: Thanks for your suggestions as well. I wasn't sure if it would
> be worth going into mata, but I will try it now. Your parallel code
> seems very interesting too. Do you think it would go faster than using
> mata?
>
> Again, thanks a lot to both of you!
>
> Cheers,
> Luis
>
> 2013/5/22 Nick Cox <[email protected]>:
>> Please use your full real name. See Statalist FAQ for that request and why.
>>
>> Some speed-ups are likely to be possible here, but I first I note
>> several puzzles with this code.
>>
>> You don't say so, but presumably -id- is _n in a variable.
>>
>> -z- is unexplained.
>>
>> More problematically,
>>
>> (a) You -generate- a variable -epsilon- but you refer to a matrix -epsilon-.
>>
>> (b) The two lines below won't work as once -x`i'- exists the second
>> command will fail.
>>
>>                 gen x`i'=z \\ generate simulated variable
>>                 gen x`i'=z + epsilon[`i',1] if id==`k' \\ Add the random part
>>
>> Why do you say that it works? Did you copy some buggy version by accident?
>>
>> Better to post self-contained code that works.
>>
>> Speed-ups, apart from using Mata. (See George Vega Yon's post.)
>>
>> 1. Too much copying from one variable to another. I could be wrong,
>> but some variables appear to be mostly zero, and you are just copying
>> constants. Think in terms of scalars instead.
>>
>> 2. Use -summarize, meanonly- to get sums. -egen- is very slow at this.
>>
>> 3. Use -in 1/`k'- or -in `k'- wherever possible. Whenever there is a
>> choice between -if- and -in- for the same problem, -in- is faster.
>>
>> Some example code:
>>
>>   gen x`i'=z
>>   replace x`i'= x`i' + epsilon[`i'] in `k'
>>
>>   su x`i' in 1/`k', meanonly
>>   replace Y`k' = (x`i')^2/r(sum) in `k'
>>
>>
>> Nick
>> [email protected]
>>
>>
>> On 22 May 2013 18:25, Luis <[email protected]> wrote:
>>> Dear statalist  users,
>>>
>>> I am running into a "loop efficiency problem" in that I have to
>>> construct a variable using many iterations and I am not sure whether I
>>> am being as efficient as possible. Given the number of observations
>>> that I have and with my current code, I have to wait days for my code
>>> to finish running! Here's my problem:
>>>
>>> I have a total of 50000 observations and need to construct a variable
>>> Y that will be computed using different subsamples of these
>>> observations. In particular,
>>> Y=Y1 when the subsample contains only the first observation,
>>> Y=Y2 when the subsample contains observations 1 and 2,
>>> Y=Y3 when the subsample contains observations 1, 2 and 3 etc until
>>> Y=Y50000.
>>>
>>> The idea is therefore to loop over the sample and define the subsample
>>> which contains observations 1 until k and construct the variable
>>> Y`k'=Yk if id==k and Y`k'=0 if id!=k. Then sum the variables Y`k'
>>> after each loop to end up with the final variable Y.
>>>
>>> To further complicate things, the variable Y needs to be the average
>>> of 100 simulations that depend on draws taken from a normal
>>> distribution. Hence I need to do a loop within the initial loop in
>>> order to do the 100 simulations.
>>>
>>> My code therefore looks like this:
>>>
>>> _____________________________________________________________________________________
>>>
>>> gen Y=0
>>>
>>> local reps=100 \\ define the number of simulations
>>>
>>> gen epsilon=rnormal() \\ generate the random var for the simulations
>>>
>>> forvalues k=1(1)50000{
>>>
>>> gen subs=(id<=`k')   \\ Define the subsample to be used
>>> gen Y`k'=0      \\ gen the intermediate Y`k'
>>>
>>>         forvalues i=1(1)`reps'{
>>>
>>>                 gen x`i'=z \\ generate simulated variable
>>>                 gen x`i'=z + epsilon[`i',1] if id==`k' \\ Add the random part
>>>
>>>         gen t=(x`i')^2
>>>         bysort subs: egen tsum=sum(x`i')
>>>
>>>         gen Y_`i'=t/tsum if id ==`k' \\ Construct Y for simulation i
>>>         replace Y_`i'=0 if id!=`k'
>>>
>>>         replace Y`k'=Y`k' + Y_`i'
>>>                 replace Y`k'=0 if id!=`k'
>>>
>>>         drop Y_`i' t tsum x`i'
>>>         }
>>>
>>> replace Y`k'=Y`k'/`reps'      // average Y from the 100 simulations
>>> replace Y= Y + Y`k'
>>> drop Y`k' subs
>>>         }
>>>
>>> ____________________________________________________________________________________
>>>
>>>
>>> The code runs fine, but I takes a lot of time since it has to
>>> construct 100 variables for each of the 50000 iterations. I have tried
>>> many different possibilities and I can't think of another way of
>>> constructing Y.
>>>
>>> Any tip or suggestion that would help improve the efficiency of my
>>> code would be greatly appreciated!!!
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Improving code speed
  - From: George Vega Yon <[email protected]>

References:
- st: Improving code speed
  - From: Luis <[email protected]>
- Re: st: Improving code speed
  - From: Nick Cox <[email protected]>
- Re: st: Improving code speed
  - From: Luis Aguiar <[email protected]>

Prev by Date: Re: st: Improving code speed
Next by Date: Re: st: Re: svmat is changing numbers - a rounding problem?
Previous by thread: Re: st: Improving code speed
Next by thread: Re: st: Improving code speed
Index(es):
- Date
- Thread