Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
From
László Sándor <[email protected]>
To
[email protected]
Subject
Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
Date
Tue, 20 Aug 2013 16:19:18 -0400
So, I reran the test on 8 cores, with Stata/MP 13, with 32 GB RAM.
I made the following changes:
1. I maxed out the number of observations. (see -h limits- and -h maxlong-)
2. Made ten byte variables taking 20 integer values, this takes up 25
GB out of the 32, close to the StataCorp recommendations of leaving
50% extra. But I did not check if virtual memory is touched, maybe I
can scale dataset down a bit.
3. So I am taking 20 bins now, in case -tabulate, sum- and loops of
-sum if, meansonly- scale differently.
4. I take only oneway tabs, as that's what I need, testing twoway was a mistake.
5. I also try a -bys bins:- "looping".
+1. I mentioned I corrected Eric's code about not looping over all
values that were "tabbed over". Now the two are comparable.
In this setup,
-- -tabulate, sum nof noobs nol nost- completes in only 1516.36
seconds, or ~25 minutes.
-- the simple frequency tab takes only 583.51 s, but again, this is
not in the run.
-- -collapse, fast- took 4025.64 seconds, much slower than -tab, sum-,
very strange. (I am pretty sure I have exclusive use of this compute
node, no other process is running or scheduling me).
-- the if-loops took 3967s, shockingly comparable to -collapse, fast-,
but still much slower than (now oneway) -tab, sum-.
-- -bys bins: sum, meanonly- took 3205 s.
So -tab, sum- is unbeatable on big data for oneway tabs with a
moderate number of bins. Or others can run other tests.
So I stick to parsing the log of -tab, sum-.
Thanks for all your thoughts,
Laszlo
On Tue, Aug 20, 2013 at 5:08 AM, László Sándor <[email protected]> wrote:
> Thanks, Maarten.
>
> My understanding of byable commands was that they loop over -if-
> conditions anyway, though -in- conditions are supposed to be less
> wasteful and would explain why the prefix requires sorted data.
>
> Trust me, this code is heavily used on big data, if each run can save
> us minutes, it is still worth it. And my current tests with maxing out
> the code in this thread with -maxlong()- number of observations (the
> limit) and thus 20 GB of data gives a 20-minute lead to -collapse-
> over -tab, sum-. However, the key comparison is with the loops here,
> and I did not catch that the test was biased in their favor as they
> did not loop over all observations. I am rerunning those tests now.
>
> On Tue, Aug 20, 2013 at 4:21 AM, Maarten Buis <[email protected]> wrote:
>> On Mon, Aug 19, 2013 at 7:30 PM, László Sándor wrote:
>>> The other option seemed to be to try to keep track of the levels of
>>> "bins", and just forval loop over the values, if-ing in a bin at a
>>> time to quickly grab the means. This was surprisingly fast, and does
>>> not seem to be any slower without a sort beforehand. Again, I am not
>>> sure any efficiency of -bys- looping of ifs does not seem to be worth
>>> the cost of the initial sorting.
>>
>> I think you are mixing up advise here: -by: <something>- is likely to
>> be faster than a -forvalues- loop combined with -if- conditions. I
>> don't think anyone suggested that you sort before that loop. The logic
>> is that an -if- condition will each time by necesisty have to go
>> through all observations. The alternative would be a single sort with
>> -in- conditions, which I guess is what is at the core of the speed of
>> the -by- prefix. Depending on how many times you want to use -if-
>> conditions, there will be a point where the combination of a single
>> -sort- and many -in- conditions will be quicker than many -if-
>> conditions. But I don't expect that -sort-ing will help if you choose
>> the -forvalues- loop combined with -if- conditions.
>>
>> On a pragmatic level: how much time have you now spent trying to write
>> this code, and how much time do you expect to safe with that? Are you
>> sure that you don't end up with a nett loss of time?
>>
>> -- Maarten
>>
>> ---------------------------------
>> Maarten L. Buis
>> WZB
>> Reichpietschufer 50
>> 10785 Berlin
>> Germany
>>
>> http://www.maartenbuis.nl
>> ---------------------------------
>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/