Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: More efficient processing of nested loops?

From	Robson Glasscock <[email protected]>
To	[email protected]
Subject	Re: st: More efficient processing of nested loops?
Date	Wed, 15 Aug 2012 15:18:56 -0400

I completely understand that, and I'm also very appreciative of the
time you spent with my problem.

best,
r
On Wed, Aug 15, 2012 at 3:02 PM, Nick Cox <[email protected]> wrote:
> Sorry, Robson, but I spent a while working on my previous post and
> will add no more time to this thread.
> Someone else may naturally be able to help.
>
> On Wed, Aug 15, 2012 at 7:55 PM, Robson Glasscock <[email protected]> wrote:
>> Nick,
>> Thank you for your help with this. Your approach gives me the return
>> for a single day. There is some variation, but that day is usually the
>> return on the date the two datasets were merged. These observations
>> have non-missing values for datadate_2 and date_t_4. I had a difficult
>> time writing this up, and I apologize if my description of the merged
>> datasets or the problem was not clear enough.
>>
>> Dataset A has around 12 years of daily returns for 3,500 firms.
>> Dataset B has the earnings announcement date for quarter t
>> (datadate_2) and quarter t-4 (date_t_4) in each row. I merged these
>> two datasets (1:1 cusip date) into Dataset C and ran the two loops.
>> There are typically around 250 days between datadate_2 and date_t_4
>> that need to be added together.
>>
>> I can provide any additional information if needed. Here is the code I
>> ran from your responses:
>>
>> bysort cum_id_cum (datadate_2) : gen d2_temp_2 = datadate_2[1]
>> bysort cum_id_cum (date_t_4) : gen dt_temp_2 = date_t_4[1]
>> egen cusip_mode = mode(cusip), by(cum_id_cum)
>> local mark "date <= d2_temp_2 & date >= dt_temp_2 & cusip_mode==cusip"
>> egen cum_ab_temp = total(ab_ret) if `mark', by(cum_id_cum)
>> replace cum_ab = cum_ab_temp if `mark'
>>
>> I wrote the two inefficient loops based on what I would do by hand to
>> sum up the abnormal returns. Go into the first cusip and identify the
>> two boundary dates for the first earnings announcement. Go back
>> through all the daily returns for that cusip and mark each day that is
>> between these two dates. Sum up these individual returns into one
>> number. Do this again for the first cusip but second earnings
>> announcement, etc.
>>
>> Thanks again for the time you have already spent on this,
>> Robson
>>
>>
>> On Wed, Aug 15, 2012 at 1:48 PM, Nick Cox <[email protected]> wrote:
>>> More speed-ups:
>>>
>>> bysort cum_id_cum (datadate2) : gen d2_temp_2 = datadate_2[1]
>>> bysort cum_id_cum (date_t_4) : gen dt_temp_2 = date_t_4[1]
>>>
>>> On Wed, Aug 15, 2012 at 6:07 PM, Nick Cox <[email protected]> wrote:
>>>> egen cum_ab_temp = total(ab_ret) if `mark', by(cum_id_cum)
>>>> replace cum_ab = r(sum) if `mark'
>>>>
>>>> should be
>>>>
>>>> egen cum_ab_temp = total(ab_ret) if `mark', by(cum_id_cum)
>>>> replace cum_ab = cum_ab_temp if `mark'
>>>>
>>>> On Wed, Aug 15, 2012 at 5:59 PM, Nick Cox <[email protected]> wrote:
>>>>> The good news is that I think you are right. This code appears to be
>>>>> much more complicated than it needs to be.
>>>>>
>>>>> I can't follow your word description -- doesn't mean it's unclear,
>>>>> just means that it is too much for me to absorb -- but from looking at
>>>>> your code there are several major and minor inefficiencies.
>>>>>
>>>>> The main problems are
>>>>>
>>>>> 1. You have two outer loops tangled together, one Stata's and one
>>>>> home-made, but neither appears necessary.
>>>>>
>>>>> 2. The inner loop is a loop over one case, and so not needed.
>>>>>
>>>>> 3. -egen- calls are very inefficient to calculate constants that
>>>>> -summarize- can calculate, except that #1 implies to me that you can
>>>>> do most of the work in a few -egen- calls.
>>>>>
>>>>> 4. Some copying of values from one place to another to no obvious purpose.
>>>>>
>>>>> With some guesswork, your code to me boils down to
>>>>>
>>>>> egen d2_temp_2 = min(datadate_2), by(cum_id_cum)
>>>>> egen dt_temp_2 = min(date_t_4), by(cum_id_cum)
>>>>> egen cusip_mode = mode(cusip), by(cum_id_cum)
>>>>> local mark "date <= d2_temp_2 & date >= dt_temp_2 & cusip_mode==cusip"
>>>>> egen cum_ab_temp = total(ab_ret) if `mark', by(cum_id_cum)
>>>>> replace cum_ab = r(sum) if `mark'
>>>>>
>>>>> Note that your use of "temporary variables" is not the same as Stata's.
>>>>>
>>>>> Nick
>>>>>
>>>>> On Wed, Aug 15, 2012 at 4:26 PM, Robson Glasscock <[email protected]> wrote:
>>>>>
>>>>>> I am running Stata 12. I have written code that creates a variable
>>>>>> that is the sum of abnormal returns for each firm. The abnormal
>>>>>> returns are accumulated from three days after the firm makes its
>>>>>> earnings announcement in quarter t-4 to three days after the firm
>>>>>> makes its earnings announcement in quarter t. My problem is that it
>>>>>> takes around 2 minutes for my code to execute for each firm/quarter
>>>>>> announcement, and there are around 150,000 earnings announcements in
>>>>>> the dataset.
>>>>>>
>>>>>> I modified a panel dataset with the earnings announcement dates for
>>>>>> each firm so that each observation contains both the quarter t
>>>>>> earnings announcement date and the quarter t-4 earnings announcement
>>>>>> date (with the three days added to each per above). I then merged this
>>>>>> dataset with a second panel dataset that contains the daily returns
>>>>>> for each firm. The merged dataset has around 10.2 million
>>>>>> observations.
>>>>>>
>>>>>> Next, I created a count variable, cum_id_cum, which is a running total
>>>>>> of each earnings announcement. This variable is truly cumulative and
>>>>>> does not reset back to 1 when the next firm releases its first
>>>>>> earnings announcement. The loop contains a variable, runn, that starts
>>>>>> with a value equal to "1" and increases by 1 each time the loop is
>>>>>> processed. I'm using that to help identify the particular dates of the
>>>>>> quarter t and quarter t-4 earnings announcement so that the abnormal
>>>>>> returns are accumulated over the correct period. Datadate_2 is the
>>>>>> quarter t earnings announcement and  date_t_4 is the quarter t-4
>>>>>> earnings announcement. Cusip is the identifier for each firm. Date is
>>>>>> the date of the firm's return in the stock market for each trading
>>>>>> day.
>>>>>>
>>>>>> The big picture of my approach was to create temporary variables that
>>>>>> will equal the cusip, datadate_2, and date_t_4 when runn equals
>>>>>> cum_id_cum. These first-step temporary variables (d2_temp, dt_temp,
>>>>>> and cusip_temp) have missing values except in the observation where
>>>>>> runn equals cum_id_cum so I created second-step temporary variables
>>>>>> (cusip_temp_2, d2_temp_2, and dt_temp_2) which place the cusip,
>>>>>> datadate_2, and date_t_4 values for each particular run through the
>>>>>> entire dataset. This allows me to mark the days for each firm that are
>>>>>> between the dates of the earnings announcements and then sum up the
>>>>>> abnormal returns in a temporary variable called cum_ab_temp. The final
>>>>>> variable with the sum of the abnormal returns for each firm is cum_ab
>>>>>> and is retained in the observation where _merge==3 (from the merge
>>>>>> mentioned in the second paragraph of this post).
>>>>>>
>>>>>> My code is below. Note that I constrain it to the first 25,000
>>>>>> cum_id_cum values due to macro size constraints:
>>>>>>
>>>>>> gen runn= 1
>>>>>> levelsof cum_id_cum if cum_id_cum !=. & cum_id_cum <25000, local(cum_id_cum)
>>>>>> foreach 1 of local cum_id_cum{
>>>>>> gen d2_temp= datadate_2 if runn== cum_id_cum
>>>>>> gen dt_temp= date_t_4 if runn== cum_id_cum
>>>>>> gen cusip_temp= cusip if runn== cum_id_cum
>>>>>> egen cusip_temp_2= mode(cusip_temp)
>>>>>> egen d2_temp_2= min(d2_temp)
>>>>>> egen dt_temp_2= min(dt_temp)
>>>>>> foreach x of varlist date{
>>>>>> replace mark= 1 if `x' <= d2_temp_2 & `x' >= dt_temp_2 & cusip_temp_2==cusip
>>>>>> egen cum_ab_temp= total(ab_ret) if mark==1
>>>>>> replace cum_ab= cum_ab_temp if datadate_2== d2_temp & dt_temp==
>>>>>> date_t_4 & cusip_temp==cusip & d2_temp !=. & dt_temp !=.
>>>>>> replace runn= runn+1
>>>>>> drop d2_temp
>>>>>> drop dt_temp
>>>>>> drop d2_temp_2
>>>>>> drop dt_temp_2
>>>>>> drop cusip_temp
>>>>>> drop cusip_temp_2
>>>>>> drop cum_ab_temp
>>>>>> replace mark=0
>>>>>>
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> I'm wondering if there is a more efficient way to do the above which
>>>>>> will result in a significantly faster processing time. My fear is that
>>>>>> the above ignores the functionality of Stata and instead uses
>>>>>> inefficient brute force.
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: More efficient processing of nested loops?
  - From: Robson Glasscock <[email protected]>
- Re: st: More efficient processing of nested loops?
  - From: Nick Cox <[email protected]>
- Re: st: More efficient processing of nested loops?
  - From: Nick Cox <[email protected]>
- Re: st: More efficient processing of nested loops?
  - From: Nick Cox <[email protected]>
- Re: st: More efficient processing of nested loops?
  - From: Robson Glasscock <[email protected]>
- Re: st: More efficient processing of nested loops?
  - From: Nick Cox <[email protected]>

Prev by Date: RE: st: display graphs on log scale
Next by Date: st: New version of -scheme_rbn1mono- on SSC
Previous by thread: Re: st: More efficient processing of nested loops?
Next by thread: st: create a variable with estimated coefficients on dummies
Index(es):
- Date
- Thread