Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
From
László Sándor <[email protected]>
To
[email protected]
Subject
Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
Date
Mon, 19 Aug 2013 13:30:11 -0400
Sorry, Nick.
For future reference: I want to plot binned means. Bins are usually
deciles of the x variable. So currently I take -tab bins, sum(xvar)-
to get 10 means as x coordinates, and then -tab bins, sum(yvar)- to
get 10 means as y coordinates.
The bins are redefined every time you pick a new x or you redefine the
sample, so I cannot just collapse on the bins once and for all.
The issue now was whether -tab bins, sum(xvar)- can be replaced with
something faster that still does not destroy or preserve the data.
One option would be -bys bins: sum y x, meansonly- indeed, but again,
I am hesitant to believe that sorting all the data for valuable of the
bins is worth it (but, yes, if precise deciles are calculated first,
the data is sorted by xvar, if we manage to preserve the sorting, so
it is sorted by bins too — however sometimes I take rough/stochastic
quantiles, without full sorting).
The other option seemed to be to try to keep track of the levels of
"bins", and just forval loop over the values, if-ing in a bin at a
time to quickly grab the means. This was surprisingly fast, and does
not seem to be any slower without a sort beforehand. Again, I am not
sure any efficiency of -bys- looping of ifs does not seem to be worth
the cost of the initial sorting.
Thanks, everyone,
Laszlo
On Mon, Aug 19, 2013 at 1:06 PM, Nick Cox <[email protected]> wrote:
> I believe whatever you say about times to read in the data again, and so forth.
>
> But if you -sort- first and then follow with lots of statements
> including -if- there is little or no gain. Stata still has to loop
> through all the observations and test every one using the -if-
> condition. The point about -sort-ing here is that calculations -by:-
> are then faster (and indeed possible at all). That's what the -egen-
> code exploits.
>
> But I am very possibly very confused on what you are calculating. This
> thread started with reference to -tabulate, summarize- but now
> (somehow) you are talking about deciles. Mean and standard deviations
> don't require prior sorting but deciles do.
>
> I think I'd better bail out now, just in cause I am adding more
> confusion to a convoluted thread.
>
> Nick
> [email protected]
>
>
> On 19 August 2013 17:56, László Sándor <[email protected]> wrote:
>> Thanks, Nick.
>>
>> 1. I will experiment with looping over 10-20 if conditions, without
>> the sorting. I cannot imagine sorted data speeding up anything that
>> much to spend the upfront time cost on sorting.
>>
>> 2. -collapse- is a no-go. I need the original data, as I am producing
>> many graphs, one after another, with different outcomes, different
>> sample restrictions etc. So no, I cannot lose (or even preserve) it
>> just to load the whole thing back in, which takes minutes if not more
>> with our current drives.
>>
>> And by the way, though are tests have not been extensive, the if-loop
>> was not much slower than -collapse, fast-, strangely enough.
>>
>> Thanks again,
>>
>> Laszlo
>>
>> On Mon, Aug 19, 2013 at 12:44 PM, Nick Cox <[email protected]> wrote:
>>> Clearly the rest of us don't have your data and can't experiment.
>>> (This is not a veiled request to send me those data, thanks!)
>>>
>>> Also, the interest of this for others I suggest lies solely in your
>>> problem, generalised, and so anything that is quirky about what you
>>> want is fine by me (us?) but not compelling for anybody else. I am
>>> concerned with general strategy as I understand it.
>>>
>>> You make two points here.
>>>
>>> 1. You would rather not -sort- if you can avoid it. Well, I think we
>>> all agree with that. But I've learned not to avoid it for many
>>> problems.
>>>
>>> 2. All you want to plot are ten deciles. You probably mentioned that
>>> several threads ago, or earlier in this thread. I agree that makes the
>>> graphical problem easier. (It seems a pity that with so much data you
>>> don't plot a lot more detail!) But if the main purpose of the
>>> calculation is to get a reduced dataset for graphing, -collapse- seems
>>> to re-enter the discussion. (Underneath the hood -graph- does an awful
>>> lot of -collapse-s.)
>>> Nick
>>> [email protected]
>>>
>>>
>>> On 19 August 2013 17:13, László Sándor <[email protected]> wrote:
>>>> Nick,
>>>>
>>>> I hope this helps to keep (or make?) this a fruitful discussion:
>>>>
>>>> I have tens of millions of observations, or more. E.g. all taxpayers
>>>> over many years. I would rather not sort them. Doesn't sorting scale
>>>> worse than -if- checks? I always have only a few bins, so I never loop
>>>> over more than a few dozens bin-values… But I would always have to
>>>> sort all observations. The sorting variable taking only a limited
>>>> number of values does not matter that much here, does it?
>>>>
>>>> And this also comes back to -scatteri-. I cannot plot (say) 10 points
>>>> for the ten deciles without having variables for the deciles, which is
>>>> scary. (No I don't want to associate the second decile with my second
>>>> observation, not even in a tempvar — and even generating tempvars and
>>>> plotting them with many tens of million missing values is very slow).
>>>> Why not plot the ten deciles directly?
>>>>
>>>> On Mon, Aug 19, 2013 at 12:01 PM, Nick Cox <[email protected]> wrote:
>>>>> These reservations don't change my advice.
>>>>>
>>>>> 1. For graphics it is easiest by far to have variables to graph. I use
>>>>> -scatteri- for special effects, but you'd have to factor in
>>>>> programming time to get that right.
>>>>>
>>>>> 2. Trying to avoid a -sort- strikes me as false economy here. -by:-
>>>>> goes hand in hand with a -sort- and the alternative of some kind of
>>>>> loop and -if- can't compete well, in my experience.
>>>>>
>>>>> Nick
>>>>> [email protected]
>>>>>
>>>>>
>>>>> On 19 August 2013 16:39, László Sándor <[email protected]> wrote:
>>>>>> Thanks, Nick.
>>>>>>
>>>>>> I would much prefer not to have a new dataset. I want to plot (and
>>>>>> maybe log) the binned means, and to use the immediate -scatteri-, I
>>>>>> need no new variables but probably a few macros with the means stored
>>>>>> in them.
>>>>>>
>>>>>> I am not sure if Mata would be faster than the looping tests we ran
>>>>>> for otherwise optimised -sum, meanonly-, they were surprisingly fast.
>>>>>> No, I haven't tried either.
>>>>>>
>>>>>> -_gmean.ado- does start with a sort, which is prohibitive with big
>>>>>> data as the primary use case.
>>>>>>
>>>>>> On Mon, Aug 19, 2013 at 11:32 AM, Nick Cox <[email protected]> wrote:
>>>>>>> I got lost somewhere early on in this thread, but I don't see mention
>>>>>>> towards the end of some other possibilities.
>>>>>>>
>>>>>>> If I understand it correctly, László wants a fast implementation
>>>>>>> equivalent to -tabulate, summarize- in place, i.e. without replacing
>>>>>>> the dataset by a reduced dataset, but with results saved to the new
>>>>>>> dataset.
>>>>>>>
>>>>>>> As earlier emphasised, he can't borrow or steal code from -tabulate,
>>>>>>> summarize- because that is compiled C code invisible to anyone except
>>>>>>> Stata developers (and they aren't about to show you (and you almost
>>>>>>> certainly wouldn't benefit from the code any way as it probably
>>>>>>> depends on lots of other C code (if it's typical of C code of this
>>>>>>> kind (or at least that's my guess)))).
>>>>>>>
>>>>>>> To that end, my thoughts are
>>>>>>>
>>>>>>> 1. Never use reduction commands if you don't want data reduction,
>>>>>>> unless you can be sure that the overhead of reading the data in again
>>>>>>> can be ignored and you can't think of a better method.
>>>>>>>
>>>>>>> 2. The possibility of using Mata was not mentioned (but no, I don't
>>>>>>> have code up my sleeve).
>>>>>>>
>>>>>>> 3. Although -egen- itself is slow, the code at the heart of _gmean.ado
>>>>>>> and _gsd.ado is where I would start. That uses -by:- and avoids a loop
>>>>>>> and although there are a few lines to be interpreted I would expect it
>>>>>>> to be pretty fast.
>>>>>>>
>>>>>>> Nick
>>>>>>> [email protected]
>>>>>>>
>>>>>>>
>>>>>>> On 19 August 2013 15:45, László Sándor <[email protected]> wrote:
>>>>>>>> Thanks, all.
>>>>>>>>
>>>>>>>> I am still confused how I could combine the speed of sum of the
>>>>>>>> methods like -collapse- without losing my data, which usually takes
>>>>>>>> dozens of GBs.
>>>>>>>>
>>>>>>>> Otherwise I think we are only talking about -tabulate- versus -table-
>>>>>>>> but both need log-parsing, or some -bys: summarize- and collecting
>>>>>>>> locals, which I did not attempt.
>>>>>>>>
>>>>>>>> FWIW, I also ran Roger's tests. Actually, I am surprised by the speed
>>>>>>>> of the many lines of -summarize, meanonly-, esp. as it runs over the
>>>>>>>> dataset many times just ifs in different observations.
>>>>>>>>
>>>>>>>> On an 8-core StataMP 13 for Linux,
>>>>>>>> full -tabulate, sum- itself took ~140s
>>>>>>>> -tab, matcell- took <5s, but indeed generates frequencies only.
>>>>>>>> a second -tabulate, sum-, even with nof and nost, took the same
>>>>>>>> also with the caplog wrapper
>>>>>>>> a -collapse, fast- took 36s, but of course this loses the data
>>>>>>>> the -summarize- took 92s without the postfiles, 34.53s with — but I
>>>>>>>> still cannot scatteri the results in the same Stata instance…
>>>>>>>>
>>>>>>>> On a 64-core StataMP 13 (in a cluster, with nodes of 8 cores plus MPIing):
>>>>>>>> full -tabulate, sum- itself took ~195s
>>>>>>>> -tab, matcell-: 8s
>>>>>>>> again same speed without frequencies and standard deviations, or with
>>>>>>>> the wrapper, for -tab, sum-
>>>>>>>> -collapse- took 60s
>>>>>>>> the loops of -summarize- took 160s now without the postfiles, 47s with.
>>>>>>>>
>>>>>
>>>>> *
>>>>> * For searches and help try:
>>>>> * http://www.stata.com/help.cgi?search
>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>
>>>> *
>>>> * For searches and help try:
>>>> * http://www.stata.com/help.cgi?search
>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>> * http://www.ats.ucla.edu/stat/stata/
>>>
>>> *
>>> * For searches and help try:
>>> * http://www.stata.com/help.cgi?search
>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>> * http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/