Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
From
László Sándor <[email protected]>
To
[email protected]
Subject
Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
Date
Thu, 22 Aug 2013 06:43:29 -0400
For those out there who care:
I wonder why this thing is not more stable. I am confident that now I
am using all 64 cores on a node of a cluster with Stata/MP 13, with
plenty of RAM. I generate the 8 byte variables of 20 values with size
maxlong(). I use the same random sorting before running any of these
methods, and I try the sequence twice independently.
Now -tab, sum()- got slower, it took roughly 36 minutes, all three
times I tried.
-collapse, fast- took less than 20-23 minutes.
The if loops took around 90-100 minutes.
The -bys- loop took only 20 minutes once, then LESS THAN 4.5 minutes
twice (with unsorted data?!).
Of course, now the question is, why doesn't -tab- use the same optimizations…
In any case, perhaps this was useful.
Laszlo
On Tue, Aug 20, 2013 at 4:19 PM, László Sándor <[email protected]> wrote:
> So, I reran the test on 8 cores, with Stata/MP 13, with 32 GB RAM.
>
> I made the following changes:
> 1. I maxed out the number of observations. (see -h limits- and -h maxlong-)
> 2. Made ten byte variables taking 20 integer values, this takes up 25
> GB out of the 32, close to the StataCorp recommendations of leaving
> 50% extra. But I did not check if virtual memory is touched, maybe I
> can scale dataset down a bit.
> 3. So I am taking 20 bins now, in case -tabulate, sum- and loops of
> -sum if, meansonly- scale differently.
> 4. I take only oneway tabs, as that's what I need, testing twoway was a mistake.
> 5. I also try a -bys bins:- "looping".
> +1. I mentioned I corrected Eric's code about not looping over all
> values that were "tabbed over". Now the two are comparable.
>
> In this setup,
> -- -tabulate, sum nof noobs nol nost- completes in only 1516.36
> seconds, or ~25 minutes.
> -- the simple frequency tab takes only 583.51 s, but again, this is
> not in the run.
> -- -collapse, fast- took 4025.64 seconds, much slower than -tab, sum-,
> very strange. (I am pretty sure I have exclusive use of this compute
> node, no other process is running or scheduling me).
> -- the if-loops took 3967s, shockingly comparable to -collapse, fast-,
> but still much slower than (now oneway) -tab, sum-.
> -- -bys bins: sum, meanonly- took 3205 s.
>
> So -tab, sum- is unbeatable on big data for oneway tabs with a
> moderate number of bins. Or others can run other tests.
>
> So I stick to parsing the log of -tab, sum-.
>
> Thanks for all your thoughts,
>
> Laszlo
>
> On Tue, Aug 20, 2013 at 5:08 AM, László Sándor <[email protected]> wrote:
>> Thanks, Maarten.
>>
>> My understanding of byable commands was that they loop over -if-
>> conditions anyway, though -in- conditions are supposed to be less
>> wasteful and would explain why the prefix requires sorted data.
>>
>> Trust me, this code is heavily used on big data, if each run can save
>> us minutes, it is still worth it. And my current tests with maxing out
>> the code in this thread with -maxlong()- number of observations (the
>> limit) and thus 20 GB of data gives a 20-minute lead to -collapse-
>> over -tab, sum-. However, the key comparison is with the loops here,
>> and I did not catch that the test was biased in their favor as they
>> did not loop over all observations. I am rerunning those tests now.
>>
>> On Tue, Aug 20, 2013 at 4:21 AM, Maarten Buis <[email protected]> wrote:
>>> On Mon, Aug 19, 2013 at 7:30 PM, László Sándor wrote:
>>>> The other option seemed to be to try to keep track of the levels of
>>>> "bins", and just forval loop over the values, if-ing in a bin at a
>>>> time to quickly grab the means. This was surprisingly fast, and does
>>>> not seem to be any slower without a sort beforehand. Again, I am not
>>>> sure any efficiency of -bys- looping of ifs does not seem to be worth
>>>> the cost of the initial sorting.
>>>
>>> I think you are mixing up advise here: -by: <something>- is likely to
>>> be faster than a -forvalues- loop combined with -if- conditions. I
>>> don't think anyone suggested that you sort before that loop. The logic
>>> is that an -if- condition will each time by necesisty have to go
>>> through all observations. The alternative would be a single sort with
>>> -in- conditions, which I guess is what is at the core of the speed of
>>> the -by- prefix. Depending on how many times you want to use -if-
>>> conditions, there will be a point where the combination of a single
>>> -sort- and many -in- conditions will be quicker than many -if-
>>> conditions. But I don't expect that -sort-ing will help if you choose
>>> the -forvalues- loop combined with -if- conditions.
>>>
>>> On a pragmatic level: how much time have you now spent trying to write
>>> this code, and how much time do you expect to safe with that? Are you
>>> sure that you don't end up with a nett loss of time?
>>>
>>> -- Maarten
>>>
>>> ---------------------------------
>>> Maarten L. Buis
>>> WZB
>>> Reichpietschufer 50
>>> 10785 Berlin
>>> Germany
>>>
>>> http://www.maartenbuis.nl
>>> ---------------------------------
>>>
>>> *
>>> * For searches and help try:
>>> * http://www.stata.com/help.cgi?search
>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/