Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: indicator variables from -by-

From	László Sándor <[email protected]>
To	[email protected]
Subject	Re: st: RE: indicator variables from -by-
Date	Mon, 26 Aug 2013 14:45:26 -0400

Yes, this is true, but bysort'ing my (Austin's) ado wrapper for the
(built-in) summarize to save the result should do the same thing. Or
you mean there are no `touse' indicators involved? If built-in
commands do by differently, then perhaps yes. But the -byable-
documentation suggests ado files do use `touse' indicators. Maybe not
a new one for each category but one and then use clever in'ing?
Probably.

All the more so, then: this cannot justify the order of magnitude
slowdown and running out of 220 GB free memory…

On Mon, Aug 26, 2013 at 1:08 PM, Joe Canner <[email protected]> wrote:
> Laszlo,
>
> My guess is that -bys- takes good advantage of the sorting.  In fact, you are not allowed to run -by- without -sort-, probably because doing so would ruin the optimization.
>
> To illustrate, try the following:
>
> gen obs=_n
> sum AGE if inrange(obs,1000000,2000000)
>
> and
>
> sum AGE in 1000000/2000000
>
> In my test (with a  dataset of almost 8 millions observations), the former (not including -gen-) took 20x longer than the latter.  Similarly, the -bys- code presumably accesses all observations in a particular level of the by variable more-or-less by observation number, rather than by -if- testing. (I think Nick Cox alluded to this a while back.)
>
> Regards,
> Joe Canner
> Johns Hopkins School of Medicine
>
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of László Sándor
> Sent: Sunday, August 25, 2013 11:55 AM
> To: [email protected]
> Subject: st: indicator variables from -by-
>
> Hi,
> I have so many observations that even the byte tempvars of
> -marksample- might make me run out of memory.
>
> But -by- must be inefficient in this, as if you -bys- over many groups (e.g. households), you never run out of memory because a new touse tempvar was created for each group.
>
> Thus I don't understand why this wrapper for -sum, meanonly- (just to collect saved results lost otherwise) runs out of copious amounts of memory (bying over 20 groups) while the -bys: sum, meanonly- is still much, much faster than any tabbing or tabstating or statsbying or Mata alternative. What does -by- handle differently about the latter what it cannot do with the former?
>
> prog mymns, byable(recall, noheader)
>  syntax [varlist] [if] [in]
>  marksample touse
>  sum `varlist' if `touse', mean
>  mat A=nullmat(A)\r(mean)
> end
>
> Thanks,
>
> Laszlo
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- RE: st: RE: indicator variables from -by-
  - From: Joe Canner <[email protected]>

References:
- st: indicator variables from -by-
  - From: László Sándor <[email protected]>
- st: RE: indicator variables from -by-
  - From: Joe Canner <[email protected]>

Prev by Date: st: Matrix Algebra in Stata vs. SAS
Next by Date: st: Duration data- count number of spells with ref to current spell
Previous by thread: st: RE: indicator variables from -by-
Next by thread: RE: st: RE: indicator variables from -by-
Index(es):
- Date
- Thread