Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: st: RE: indicator variables from -by-
From
Joe Canner <[email protected]>
To
"[email protected]" <[email protected]>
Subject
RE: st: RE: indicator variables from -by-
Date
Wed, 28 Aug 2013 20:22:09 +0000
I'm puzzled by the memory issue. I am using Stata 12 and don't pay much attention to memory allocation except when too many people are doing too many things, but even when I put a -memory- command at various locations in the wrapper I see very little extra memory usage above and beyond the actual data set. Even on a theoretical level, I can't see why it would need considerable amounts of memory just to keep track of a few -byable- housekeeping variables/macros and to compute the mean. Can you send some details regarding memory usage?
-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of László Sándor
Sent: Wednesday, August 28, 2013 3:53 PM
To: [email protected]
Subject: Re: st: RE: indicator variables from -by-
I have not dug deep, perhaps the byvar was not missing when the if condition was true. Funny that the -if- took precedence, then.
More to the point: the new wrapper still runs out of the copious amounts of memory I gave to the program. I expected that marksample wasn't really the problem. Using mymns with long (-maxlong()-) but narrow data with 1100% of data size available in memory was not enough to complete.
On Wed, Aug 28, 2013 at 3:34 PM, Joe Canner <[email protected]> wrote:
> How did you get it to suppress missing byvar levels using -if-? Under the circumstances, I'm surprised you were able to get that to happen at all.
>
> In any case, the only solution I can think of is to use -byable(onecall)- and controlling the process yourself so that missing levels aren't used. If your by variable is derived from your analysis variable (i.e., the binned mean problem) you could use the _byindex macro and _byindex() function to see if all of the observations in a by group are missing and suppress execution of -sum- in that case.
>
> It would be nice if -bys- had an option for not including missing levels, or at least it would be nice if -byable- had a way to see the value of the currently executing by group. The latter seems so basic, I feel like it must be there and I am just missing it.
>
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of László
> Sándor
> Sent: Wednesday, August 28, 2013 1:26 PM
> To: [email protected]
> Subject: Re: st: RE: indicator variables from -by-
>
> That said, it is a bit confusing that -by- has an extra (?) run for the byvar missing also, but not if we also used an if condition. This makes parsing harder.
>
> On Wed, Aug 28, 2013 at 1:17 PM, László Sándor <[email protected]> wrote:
>> Oh, I should have tried, then, Joe, sorry.
>>
>> I thought the byable command would not take both an `in' and and `if'.
>> Sorry, I see -sum `varlist' in `=_byn1()'/`=_byn2()' `if'- works.
>>
>> Surely this is what StataCorp does, then, sorry about the hasty comments.
>>
>> On Wed, Aug 28, 2013 at 10:20 AM, Joe Canner <[email protected]> wrote:
>>> Laszlo,
>>>
>>> I'm not exactly sure what you are asking, but I was able to test my modification of your wrapper on a large data set and confirmed that it is faster. Actually, I modified it some more to allow -if- qualifiers:
>>>
>>> prog mymns, byable(recall, noheader) syntax [varlist] [if] sum
>>> `varlist' in `=_byn1()'/`=_byn2()' `if'
>>> mat A=nullmat(A)\r(mean)
>>> end
>>>
>>> Benchmark:
>>> bys bins: mymns AGE // t=20.17s
>>> bys bins: mymns AGE if RACE==1 // t=21.45s
>>>
>>> using your original wrapper (with -marksample-) bys bins: mymns AGE
>>> // t=41.11s bys bins: mymns AGE if RACE==1 // t=41.73s
>>>
>>> Since the vast majority of the time required is taken up with sorting, we are only seeing a 2x improvement in performance. However, if sorting is already done, we would see the 20x improvement seen before.
>>>
>>> Note also that my modified wrapper cannot be used with [in]. If for some strange reason you want to do that as well you can revert to your original method:
>>>
>>> prog mymns, byable(recall, noheader) syntax [varlist] [if] [in] if
>>> ("`in'"!="") {
>>> marksample touse
>>> sum `varlist' if `touse'
>>> }
>>> else {
>>> sum `varlist' in `=_byn1()'/`=_byn2()' `if'
>>> }
>>> mat A=nullmat(A)\r(mean)
>>> end
>>>
>>> While one can't be sure (without insider information from StataCorp) what -bys- actually does; based on the consistency between these benchmarks and the ones using -bys- directly (no wrapper), my guess is that StataCorp does something similar. In fact, on a whim, I tried the following:
>>>
>>> . bys bins: sum AGE in 1000000/2000000
>>>
>>> It spent the requisite amount of time sorting first and *then*, as it was trying to spit out the results for the first level of "bins" gave the error "`in' may not be combined with `by'", which is the same error I got when I tried to run my original modification to your wrapper. So, apparently StataCorp recognizes that [in] doesn't make sense in this context, or else they don't want it to screw up the performance gains (or both). Interestingly, I could not find anything in the help for -by- that indicates that [in] cannot be used with -by-.
>>>
>>> Regards,
>>> Joe
>>>
>>> -----Original Message-----
>>> From: [email protected]
>>> [mailto:[email protected]] On Behalf Of László
>>> Sándor
>>> Sent: Wednesday, August 28, 2013 4:52 AM
>>> To: [email protected]
>>> Subject: Re: st: RE: indicator variables from -by-
>>>
>>> Thanks, Joe.
>>>
>>> I think my trouble is allowing for if conditions for the byable command without killing the purpose of using the functions. Or even simply combining them with the if condition for the command. Or you say that if there is an if condition, StataCorp's byable commands themselves also simply loop over their if conditions? Still hard to believe for use case like -bys household:- etc where each if conditions ifs in only a tiny fraction of the data.
>>>
>>> On Tue, Aug 27, 2013 at 6:37 PM, Joe Canner <[email protected]> wrote:
>>>> I think the _byn1() and _byn2() functions (or something equivalent) are what StataCorp is using internally with -bys-. I was able to get these to work just fine (although I didn't try it on a large dataset), so the problem is not that StataCorp is hiding functionality, it is that they don't highlight the fact that you might want to use _byn1() and _byn2() to improve performance.
>>>>
>>>> Let me know if you are having trouble with _byn1() and _byn2().
>>>> ________________________________________
>>>> From: [email protected]
>>>> [[email protected]] on behalf of László Sándor
>>>> [[email protected]]
>>>> Sent: Tuesday, August 27, 2013 5:26 PM
>>>> To: [email protected]
>>>> Subject: Re: st: RE: indicator variables from -by-
>>>>
>>>> I still think it's a different thing from offering (encouraging,
>>>> welcoming) developers to write -byable- commands while having a
>>>> secret, unacknowledged, undocumented way that only StataCorp
>>>> commands tap into. So anything not built-in is missing out, even
>>>> the StataCorp ado files?
>>>>
>>>> I would have thought they would simply have an infrastructure for
>>>> by they use just like us, and if it has to be slower with if
>>>> conditions, so be it.
>>>>
>>>> Or maybe there is no secret functionality of the prefix, only they
>>>> manually add conditions to their commands running bys differently
>>>> if `if'==""? And they don't automate this? Strange.
>>>>
>>>> On Tue, Aug 27, 2013 at 3:21 PM, Joe Canner <[email protected]> wrote:
>>>>> I don't mean to imply that StataCorp was previously uninterested in performance. I get the impression, however, that efforts were more aimed at complex modeling procedures than at run-of-the-mill data processing and basic descriptive statistics. I think historically people have not seen Stata as a tool for analyzing large data sets (rightfully so), but now that memory is available for those datasets, people are starting to catch on, and StataCorp is starting to focus more on this issue.
>>>>>
>>>>> I would also say that perhaps StataCorp was not very good (i.e., overly-modest) about advertising performance improvements. Perhaps this is because they didn't think many people would care, or perhaps the performance improvements were side effects of other changes.
>>>>>
>>>>> From a user education perspective, there is also a deficiency in this regard in the way Stata is taught in schools, at least in my limited experience. Students are taught using very small sample data sets and are thus not taught how to code in an efficient way. When they suddenly come into contact with large data sets, they still use methods that are inefficient but which were fine for small data sets.
>>>>>
>>>>> -----Original Message-----
>>>>> From: [email protected]
>>>>> [mailto:[email protected]] On Behalf Of Nick
>>>>> Cox
>>>>> Sent: Tuesday, August 27, 2013 2:50 PM
>>>>> To: [email protected]
>>>>> Subject: Re: st: RE: indicator variables from -by-
>>>>>
>>>>> Yes and no. StataCorp over 20+ years have devoted enormous efforts to speeding up code! But they add new functionality too. The question is:
>>>>> Where to strike the balance? It's hardly the case that StataCorp are indifferent to speed, but those pesky users keep asking for new modelling commands.
>>>>>
>>>>> Actually, I suspect Joe agrees.
>>>>>
>>>>> I am reminded obliquely of a fraught Library Committee meeting
>>>>> several years ago, in which a rather irritated librarian snapped
>>>>> at the
>>>>> academics: "We could do a really good job of organising the
>>>>> library but people keep coming in and borrowing books!" (Scary
>>>>> thing was that she seemed to mean exactly what she said, that
>>>>> users were a
>>>>> nuisance.) Nick [email protected]
>>>>>
>>>>>
>>>>> On 27 August 2013 19:42, Joe Canner <[email protected]> wrote:
>>>>>> I presented a paper at the recent Stata Conference in New Orleans on optimizing Stata code for speed. Based on the response from Stata, it appears that this hasn't been on their radar until recently. Keep in mind that it is only relatively recently that memory was cheap enough that large Stata data sets could fit in memory and thus spawn issues of performance even for basic tasks. Now that memory availability has caught up with these big datasets, Stata is becoming more concerned about the issue. I suspect you will see more efforts in this direction in future versions.
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: [email protected]
>>>>>> [mailto:[email protected]] On Behalf Of László
>>>>>> Sándor
>>>>>> Sent: Tuesday, August 27, 2013 2:35 PM
>>>>>> To: [email protected]
>>>>>> Subject: Re: st: RE: indicator variables from -by-
>>>>>>
>>>>>> Thanks, Joe, this was very educational.
>>>>>>
>>>>>> I just wonder why StataCorp doesn't want to or cannot explain to third-party developers how to write byable ado files that exploit the speed of in vs if. I mean that the documentation clearly suggests using -marksample- in byable commands, which you seem to conclude that itself precludes some optimization available in the -bys- prefix otherwise.
>>>>>>
>>>>>> My original code was slow without any special if condition, only using the recommended marksample in my byable command.
>>>>>>
>>>>>> On Tue, Aug 27, 2013 at 9:43 AM, Joe Canner <[email protected]> wrote:
>>>>>>> Laszlo,
>>>>>>>
>>>>>>> First, I don't think we can assume that the built-in -bysort- prefix uses -marksample- and `touse' macros. No doubt they have achieved some efficiencies that user-written -byable- programs cannot.
>>>>>>>
>>>>>>> That said, it does appear that adding an -if- qualifier to a -bys:- command slows down performance. Take the example I provided yesterday; if you add an equivalent -if- qualifier to each:
>>>>>>>
>>>>>>> . sum AGE if inrange(obs,1000000,2000000) & RACE==1 versus . sum
>>>>>>> AGE in 1000000/2000000 if RACE==1
>>>>>>>
>>>>>>> The latter is still faster than the former but only by a factor of 6 or 7, rather than 20. Consider also the following:
>>>>>>>
>>>>>>> . bys bins: sum AGE
>>>>>>> versus
>>>>>>> . bys bins: sum AGE if RACE==1
>>>>>>>
>>>>>>> For my 8 million record dataset the latter took 18 seconds and the former took 17 seconds, despite the fact that the latter involves a 65% subset of the population.
>>>>>>>
>>>>>>> So, it is clear that Stata has written -bys- to be optimized for the case where there are no qualifiers, presumably because it can take advantage of the sorting.
>>>>>>>
>>>>>>> Incidentally, I also tried the following comparison:
>>>>>>> . bys bins RACE: sum AGE
>>>>>>> versus
>>>>>>> . bys bins: sum AGE if RACE==1
>>>>>>>
>>>>>>> The former took only 19.4 seconds, compared to 18.5 for the latter, despite producing six times as much output. In other words, if you plan to run a command more than once with several different -if- qualifiers, looking at different levels of the same variable, you might as well put that variable in the -bys- varlist.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Joe
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: [email protected]
>>>>>>> [mailto:[email protected]] On Behalf Of
>>>>>>> László Sándor
>>>>>>> Sent: Monday, August 26, 2013 5:37 PM
>>>>>>> To: [email protected]
>>>>>>> Subject: Re: st: RE: indicator variables from -by-
>>>>>>>
>>>>>>> Thanks, Joe.
>>>>>>>
>>>>>>> I understand the concern, but it is hard to imagine that any byable command if's over all groups because the in-trick cannot be implemented. Again, this would be infeasible for by'ing over many groups.
>>>>>>>
>>>>>>> I suspect that something else might be the key because even the documentation mentions when introducing the new _by functions and macros that:
>>>>>>>
>>>>>>> So let's consider the problems one at a time, beginning with the second problem. Your program does not use marksample, and we will assume that your program has good reason for not doing so, because the easy fix would be to use marksample. Still, your program must somehow be determining which observations to use, and we will assume that you are creating a 'touse' temporary variable containing 0 if the observation is to be omitted from the analysis and 1 if it is to be used. Somewhere, early in your program, you are setting the 'touse'
>>>>>>> variable.
>>>>>>>
>>>>>>> But of course, if I have `touse', the whole dummy-generation
>>>>>>> problem comes back, plus it is not easy to use _byn1() _byn2().
>>>>>>>
>>>>>>> On Mon, Aug 26, 2013 at 4:03 PM, Joe Canner <[email protected]> wrote:
>>>>>>>> I'm not real familiar with -byable-, but there is some interesting information on it in the PDF documentation (p.pdf, page 8). In particular, there are built-in functions _byn1() and _byn2() which return the first and last observation number of the current by-group. Thus, it is up to the -byable- program to make use of this information for efficiency purposes. Otherwise, if you use `touse' indicators you are stuck with using -if- to identify by-group members.
>>>>>>>>
>>>>>>>> So, presumably your wrapper could look something like this:
>>>>>>>>
>>>>>>>> prog mymns, byable(recall, noheader) syntax [varlist] [if] [in]
>>>>>>>> sum `varlist' in `=_byn1()'/`=_byn2()', mean mat
>>>>>>>> A=nullmat(A)\r(mean) end
>>>>>>>>
>>>>>>>> Keep in mind however, that if the program is called with -if- or -in-, the program will still have to deal with that as well using -marksample-. So, if you want the wrapper program to be as efficient as possible, it may be better to prohibit using -if- and -in-, or else have the program deal with those calls separately.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Joe
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: [email protected]
>>>>>>>> [mailto:[email protected]] On Behalf Of
>>>>>>>> László Sándor
>>>>>>>> Sent: Monday, August 26, 2013 2:45 PM
>>>>>>>> To: [email protected]
>>>>>>>> Subject: Re: st: RE: indicator variables from -by-
>>>>>>>>
>>>>>>>> Yes, this is true, but bysort'ing my (Austin's) ado wrapper for
>>>>>>>> the
>>>>>>>> (built-in) summarize to save the result should do the same thing. Or you mean there are no `touse' indicators involved? If built-in commands do by differently, then perhaps yes. But the -byable- documentation suggests ado files do use `touse' indicators. Maybe not a new one for each category but one and then use clever in'ing?
>>>>>>>> Probably.
>>>>>>>>
>>>>>>>> All the more so, then: this cannot justify the order of
>>>>>>>> magnitude slowdown and running out of 220 GB free memory.
>>>>>>>>
>>>>>>>> On Mon, Aug 26, 2013 at 1:08 PM, Joe Canner <[email protected]> wrote:
>>>>>>>>> Laszlo,
>>>>>>>>>
>>>>>>>>> My guess is that -bys- takes good advantage of the sorting. In fact, you are not allowed to run -by- without -sort-, probably because doing so would ruin the optimization.
>>>>>>>>>
>>>>>>>>> To illustrate, try the following:
>>>>>>>>>
>>>>>>>>> gen obs=_n
>>>>>>>>> sum AGE if inrange(obs,1000000,2000000)
>>>>>>>>>
>>>>>>>>> and
>>>>>>>>>
>>>>>>>>> sum AGE in 1000000/2000000
>>>>>>>>>
>>>>>>>>> In my test (with a dataset of almost 8 millions
>>>>>>>>> observations), the former (not including -gen-) took 20x longer than the latter.
>>>>>>>>> Similarly, the -bys- code presumably accesses all observations
>>>>>>>>> in a particular level of the by variable more-or-less by
>>>>>>>>> observation number, rather than by -if- testing. (I think Nick
>>>>>>>>> Cox alluded to this a while back.)
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Joe Canner
>>>>>>>>> Johns Hopkins School of Medicine
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: [email protected]
>>>>>>>>> [mailto:[email protected]] On Behalf Of
>>>>>>>>> László Sándor
>>>>>>>>> Sent: Sunday, August 25, 2013 11:55 AM
>>>>>>>>> To: [email protected]
>>>>>>>>> Subject: st: indicator variables from -by-
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> I have so many observations that even the byte tempvars of
>>>>>>>>> -marksample- might make me run out of memory.
>>>>>>>>>
>>>>>>>>> But -by- must be inefficient in this, as if you -bys- over many groups (e.g. households), you never run out of memory because a new touse tempvar was created for each group.
>>>>>>>>>
>>>>>>>>> Thus I don't understand why this wrapper for -sum, meanonly- (just to collect saved results lost otherwise) runs out of copious amounts of memory (bying over 20 groups) while the -bys: sum, meanonly- is still much, much faster than any tabbing or tabstating or statsbying or Mata alternative. What does -by- handle differently about the latter what it cannot do with the former?
>>>>>>>>>
>>>>>>>>> prog mymns, byable(recall, noheader) syntax [varlist] [if]
>>>>>>>>> [in] marksample touse sum `varlist' if `touse', mean mat
>>>>>>>>> A=nullmat(A)\r(mean) end
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Laszlo
>>>>>>>>> *
>>>>>>>>> * For searches and help try:
>>>>>>>>> * http://www.stata.com/help.cgi?search
>>>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>>>>>>
>>>>>>>>> *
>>>>>>>>> * For searches and help try:
>>>>>>>>> * http://www.stata.com/help.cgi?search
>>>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>>>>>
>>>>>>>> *
>>>>>>>> * For searches and help try:
>>>>>>>> * http://www.stata.com/help.cgi?search
>>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>>>>>
>>>>>>>> *
>>>>>>>> * For searches and help try:
>>>>>>>> * http://www.stata.com/help.cgi?search
>>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>>>>
>>>>>>> *
>>>>>>> * For searches and help try:
>>>>>>> * http://www.stata.com/help.cgi?search
>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>>>>
>>>>>>> *
>>>>>>> * For searches and help try:
>>>>>>> * http://www.stata.com/help.cgi?search
>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>>>
>>>>>> *
>>>>>> * For searches and help try:
>>>>>> * http://www.stata.com/help.cgi?search
>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>>>
>>>>>> *
>>>>>> * For searches and help try:
>>>>>> * http://www.stata.com/help.cgi?search
>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>>
>>>>> *
>>>>> * For searches and help try:
>>>>> * http://www.stata.com/help.cgi?search
>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>>
>>>>> *
>>>>> * For searches and help try:
>>>>> * http://www.stata.com/help.cgi?search
>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>
>>>> *
>>>> * For searches and help try:
>>>> * http://www.stata.com/help.cgi?search
>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>
>>>> *
>>>> * For searches and help try:
>>>> * http://www.stata.com/help.cgi?search
>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>> * http://www.ats.ucla.edu/stat/stata/
>>>
>>> *
>>> * For searches and help try:
>>> * http://www.stata.com/help.cgi?search
>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>> * http://www.ats.ucla.edu/stat/stata/
>>>
>>> *
>>> * For searches and help try:
>>> * http://www.stata.com/help.cgi?search
>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>> * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/