Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: RE: indicator variables from -by-
From
László Sándor <[email protected]>
To
[email protected]
Subject
Re: st: RE: indicator variables from -by-
Date
Mon, 26 Aug 2013 17:37:28 -0400
Thanks, Joe.
I understand the concern, but it is hard to imagine that any byable
command if's over all groups because the in-trick cannot be
implemented. Again, this would be infeasible for by'ing over many
groups.
I suspect that something else might be the key because even the
documentation mentions when introducing the new _by functions and
macros that:
So let’s consider the problems one at a time, beginning with the
second problem. Your program does not use marksample, and we will
assume that your program has good reason for not doing so, because the
easy fix would be to use marksample. Still, your program must somehow
be determining which observations to use, and we will assume that you
are creating a ‘touse’ temporary variable containing 0 if the
observation is to be omitted from the analysis and 1 if it is to be
used. Somewhere, early in your program, you are setting the ‘touse’
variable.
But of course, if I have `touse', the whole dummy-generation problem
comes back, plus it is not easy to use _byn1() _byn2()…
On Mon, Aug 26, 2013 at 4:03 PM, Joe Canner <[email protected]> wrote:
> I'm not real familiar with -byable-, but there is some interesting information on it in the PDF documentation (p.pdf, page 8). In particular, there are built-in functions _byn1() and _byn2() which return the first and last observation number of the current by-group. Thus, it is up to the -byable- program to make use of this information for efficiency purposes. Otherwise, if you use `touse' indicators you are stuck with using -if- to identify by-group members.
>
> So, presumably your wrapper could look something like this:
>
> prog mymns, byable(recall, noheader)
> syntax [varlist] [if] [in]
> sum `varlist' in `=_byn1()'/`=_byn2()', mean
> mat A=nullmat(A)\r(mean)
> end
>
> Keep in mind however, that if the program is called with -if- or -in-, the program will still have to deal with that as well using -marksample-. So, if you want the wrapper program to be as efficient as possible, it may be better to prohibit using -if- and -in-, or else have the program deal with those calls separately.
>
> Regards,
> Joe
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of László Sándor
> Sent: Monday, August 26, 2013 2:45 PM
> To: [email protected]
> Subject: Re: st: RE: indicator variables from -by-
>
> Yes, this is true, but bysort'ing my (Austin's) ado wrapper for the
> (built-in) summarize to save the result should do the same thing. Or you mean there are no `touse' indicators involved? If built-in commands do by differently, then perhaps yes. But the -byable- documentation suggests ado files do use `touse' indicators. Maybe not a new one for each category but one and then use clever in'ing?
> Probably.
>
> All the more so, then: this cannot justify the order of magnitude slowdown and running out of 220 GB free memory…
>
> On Mon, Aug 26, 2013 at 1:08 PM, Joe Canner <[email protected]> wrote:
>> Laszlo,
>>
>> My guess is that -bys- takes good advantage of the sorting. In fact, you are not allowed to run -by- without -sort-, probably because doing so would ruin the optimization.
>>
>> To illustrate, try the following:
>>
>> gen obs=_n
>> sum AGE if inrange(obs,1000000,2000000)
>>
>> and
>>
>> sum AGE in 1000000/2000000
>>
>> In my test (with a dataset of almost 8 millions observations), the
>> former (not including -gen-) took 20x longer than the latter.
>> Similarly, the -bys- code presumably accesses all observations in a
>> particular level of the by variable more-or-less by observation
>> number, rather than by -if- testing. (I think Nick Cox alluded to this
>> a while back.)
>>
>> Regards,
>> Joe Canner
>> Johns Hopkins School of Medicine
>>
>>
>> -----Original Message-----
>> From: [email protected]
>> [mailto:[email protected]] On Behalf Of László
>> Sándor
>> Sent: Sunday, August 25, 2013 11:55 AM
>> To: [email protected]
>> Subject: st: indicator variables from -by-
>>
>> Hi,
>> I have so many observations that even the byte tempvars of
>> -marksample- might make me run out of memory.
>>
>> But -by- must be inefficient in this, as if you -bys- over many groups (e.g. households), you never run out of memory because a new touse tempvar was created for each group.
>>
>> Thus I don't understand why this wrapper for -sum, meanonly- (just to collect saved results lost otherwise) runs out of copious amounts of memory (bying over 20 groups) while the -bys: sum, meanonly- is still much, much faster than any tabbing or tabstating or statsbying or Mata alternative. What does -by- handle differently about the latter what it cannot do with the former?
>>
>> prog mymns, byable(recall, noheader)
>> syntax [varlist] [if] [in]
>> marksample touse
>> sum `varlist' if `touse', mean
>> mat A=nullmat(A)\r(mean)
>> end
>>
>> Thanks,
>>
>> Laszlo
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/