Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Is -collapse- the Stata's fastest routine to summarize data sets?
From
Eric Booth <[email protected]>
To
"<[email protected]>" <[email protected]>
Subject
Re: st: Is -collapse- the Stata's fastest routine to summarize data sets?
Date
Sat, 10 Jul 2010 01:53:36 +0000
<>
Hi Tony:
I have rmsg on permanently, so I think in that sense this is simpler to use. I also like how, in addition to showing the time it takes to run each command & an entire do-file takes to run, rmsg also displays the actual timestamp for when a command completes--which can be useful when running something overnight.
That being said, rmsg is often a better supplement than substitute to the other timing commands. In the case of my example in the previous post, rmsg is a simple way to get the time of the -collapse- and -tabout- command since I was interested in those commands only, but if I were interested in the time to run a group/section of commands, then I would be stuck adding up the rmsg times (or subtracting the timestamps). This could be a pain if I had to scroll through results window output or a log-file to find these timestamps. In this case, it is useful to use -timer- and then add -timer list- to the end of the do-file to get a report on how long each sub-section of interest took to run. In addition, you could write in some quick comparisons of the time to run sections of code using the stored values (e.g. di `r(t1)'/`r(t2)' ).
Finally, if you were interested in how long components of programs take to run, you can use profiler to get a more detailed look. For instance, turning on profiler before running -collapse- and -tabout- would give you an output like:
. profiler report
collapse
7 0.201 collapse
7 0.001 GetOpStat
7 0.002 GetVarlist
14 0.000 Setnf
7 0.002 bynottar
14 0.347 _sum
0.553 Total
tabout
7 0.052 tabout
6 0.050 sum_oneway
264 0.064 do_statres
6 0.033 sum_write
6 0.001 clearglobs
0.200 Total
Overall total count = 359
Overall total time = 0.753 (sec)
r; t=0.00 20:42:39
~ Eric
__
Eric A. Booth
Public Policy Research Institute
Texas A&M University
[email protected]
Office: +979.845.6754
Fax: +979.845.0249
http://ppri.tamu.edu
On Jul 9, 2010, at 9:44 AM, Lachenbruch, Peter wrote:
> A quick question related to this: I note that many use the timer function to get timings. I have sometimes used rmsg (set rmsg on) which gives the timing after each command. Would this be simpler?
> Tony
>
> ________________________________________
> From: [email protected] [[email protected]] On Behalf Of Eric Booth [[email protected]]
> Sent: Thursday, July 08, 2010 5:36 PM
> To: <[email protected]>
> Subject: Re: st: Is -collapse- the Stata's fastest routine to summarize data sets?
>
> <>
>
> Tiago:
>
> When summarizing a large dataset, I've found the program that runs the fastest for me is -tabout- (from SSC).
> I don't know enough about what's going on in the tabout adofile to know why it's faster and it may not be faster for all types of summary tables, but I when I changed from -collapse-/-contract- to -tabout- in my do-file there was a huge time savings when working with a dataset of about 60 million obs.
>
> For an illustration, here's a speed comparison for creating the same summary table with these 2 packages:
>
> ******************!
> clear all
> ** | change -set mem- and -expand- below to fit your system | **
> set mem 12g
> sysuse auto
> cap which tabout
> if _rc ssc install tabout
>
> **create a large dataset**
> expand 950000
> desc, sh
> recode rep78 (.=9)
>
> **test collapse vs. tabout**
>
> // 1. collapse
> ds make rep78, not
> local vars `r(varlist)'
> **
> timer clear 1
> timer on 1
> collapse (sum) `vars' , by(rep78)
> timer off 1
> save master
>
> // 2. tabout
> local vars: subinstr local vars " " " sum ", all
> di "`vars'"
> **
> timer clear 2
> timer on 2
> tabout rep78 using test.xls, replace sum c(sum `vars')
> timer off 2
>
> **make sure these are creating the same summary tables**
> cf _all using master.dta, verbose all
> **
> timer list
> ******************!
>
> /*
> timer list
> 1: 240.41 / 1 = 240.4130
> 2: 0.43 / 1 = 0.4340
> */
>
> 4 minutes for -collapse- versus less than a second for -tabout- summary table (using Stata 11.1 MP on Mac OS X).
> Good luck.
>
> ~ Eric
> __
> Eric A. Booth
> Public Policy Research Institute
> Texas A&M University
> [email protected]
> Office: +979.845.6754
>
>
>
> On Jul 8, 2010, at 9:02 AM, Tiago V. Pereira wrote:
>
>> Dear Statalister,
>>
>> I am eager to know any faster alternatives to -collapse-, because I have
>> to summarize relatively large data sets for a simulation study. -profiler-
>> is telling me that most of the computation burden comes from -collapse-.
>> Do you know (have) any faster alternative? Perhaps a plug-in?
>>
>> Thanks!
>>
>> Tiago
>>
>> *
>> * For searches and help try:
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/