Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | László Sándor <sandorl@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: where is StataCorp C code located? all in a single executable as compiled binary? |
Date | Tue, 20 Aug 2013 16:19:18 -0400 |
So, I reran the test on 8 cores, with Stata/MP 13, with 32 GB RAM. I made the following changes: 1. I maxed out the number of observations. (see -h limits- and -h maxlong-) 2. Made ten byte variables taking 20 integer values, this takes up 25 GB out of the 32, close to the StataCorp recommendations of leaving 50% extra. But I did not check if virtual memory is touched, maybe I can scale dataset down a bit. 3. So I am taking 20 bins now, in case -tabulate, sum- and loops of -sum if, meansonly- scale differently. 4. I take only oneway tabs, as that's what I need, testing twoway was a mistake. 5. I also try a -bys bins:- "looping". +1. I mentioned I corrected Eric's code about not looping over all values that were "tabbed over". Now the two are comparable. In this setup, -- -tabulate, sum nof noobs nol nost- completes in only 1516.36 seconds, or ~25 minutes. -- the simple frequency tab takes only 583.51 s, but again, this is not in the run. -- -collapse, fast- took 4025.64 seconds, much slower than -tab, sum-, very strange. (I am pretty sure I have exclusive use of this compute node, no other process is running or scheduling me). -- the if-loops took 3967s, shockingly comparable to -collapse, fast-, but still much slower than (now oneway) -tab, sum-. -- -bys bins: sum, meanonly- took 3205 s. So -tab, sum- is unbeatable on big data for oneway tabs with a moderate number of bins. Or others can run other tests. So I stick to parsing the log of -tab, sum-. Thanks for all your thoughts, Laszlo On Tue, Aug 20, 2013 at 5:08 AM, László Sándor <sandorl@gmail.com> wrote: > Thanks, Maarten. > > My understanding of byable commands was that they loop over -if- > conditions anyway, though -in- conditions are supposed to be less > wasteful and would explain why the prefix requires sorted data. > > Trust me, this code is heavily used on big data, if each run can save > us minutes, it is still worth it. And my current tests with maxing out > the code in this thread with -maxlong()- number of observations (the > limit) and thus 20 GB of data gives a 20-minute lead to -collapse- > over -tab, sum-. However, the key comparison is with the loops here, > and I did not catch that the test was biased in their favor as they > did not loop over all observations. I am rerunning those tests now. > > On Tue, Aug 20, 2013 at 4:21 AM, Maarten Buis <maartenlbuis@gmail.com> wrote: >> On Mon, Aug 19, 2013 at 7:30 PM, László Sándor wrote: >>> The other option seemed to be to try to keep track of the levels of >>> "bins", and just forval loop over the values, if-ing in a bin at a >>> time to quickly grab the means. This was surprisingly fast, and does >>> not seem to be any slower without a sort beforehand. Again, I am not >>> sure any efficiency of -bys- looping of ifs does not seem to be worth >>> the cost of the initial sorting. >> >> I think you are mixing up advise here: -by: <something>- is likely to >> be faster than a -forvalues- loop combined with -if- conditions. I >> don't think anyone suggested that you sort before that loop. The logic >> is that an -if- condition will each time by necesisty have to go >> through all observations. The alternative would be a single sort with >> -in- conditions, which I guess is what is at the core of the speed of >> the -by- prefix. Depending on how many times you want to use -if- >> conditions, there will be a point where the combination of a single >> -sort- and many -in- conditions will be quicker than many -if- >> conditions. But I don't expect that -sort-ing will help if you choose >> the -forvalues- loop combined with -if- conditions. >> >> On a pragmatic level: how much time have you now spent trying to write >> this code, and how much time do you expect to safe with that? Are you >> sure that you don't end up with a nett loss of time? >> >> -- Maarten >> >> --------------------------------- >> Maarten L. Buis >> WZB >> Reichpietschufer 50 >> 10785 Berlin >> Germany >> >> http://www.maartenbuis.nl >> --------------------------------- >> >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/