Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <njcoxstata@gmail.com> |
To | "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |
Subject | Re: st: where is StataCorp C code located? all in a single executable as compiled binary? |
Date | Mon, 19 Aug 2013 16:32:06 +0100 |
I got lost somewhere early on in this thread, but I don't see mention towards the end of some other possibilities. If I understand it correctly, László wants a fast implementation equivalent to -tabulate, summarize- in place, i.e. without replacing the dataset by a reduced dataset, but with results saved to the new dataset. As earlier emphasised, he can't borrow or steal code from -tabulate, summarize- because that is compiled C code invisible to anyone except Stata developers (and they aren't about to show you (and you almost certainly wouldn't benefit from the code any way as it probably depends on lots of other C code (if it's typical of C code of this kind (or at least that's my guess)))). To that end, my thoughts are 1. Never use reduction commands if you don't want data reduction, unless you can be sure that the overhead of reading the data in again can be ignored and you can't think of a better method. 2. The possibility of using Mata was not mentioned (but no, I don't have code up my sleeve). 3. Although -egen- itself is slow, the code at the heart of _gmean.ado and _gsd.ado is where I would start. That uses -by:- and avoids a loop and although there are a few lines to be interpreted I would expect it to be pretty fast. Nick njcoxstata@gmail.com On 19 August 2013 15:45, László Sándor <sandorl@gmail.com> wrote: > Thanks, all. > > I am still confused how I could combine the speed of sum of the > methods like -collapse- without losing my data, which usually takes > dozens of GBs. > > Otherwise I think we are only talking about -tabulate- versus -table- > but both need log-parsing, or some -bys: summarize- and collecting > locals, which I did not attempt. > > FWIW, I also ran Roger's tests. Actually, I am surprised by the speed > of the many lines of -summarize, meanonly-, esp. as it runs over the > dataset many times just ifs in different observations. > > On an 8-core StataMP 13 for Linux, > full -tabulate, sum- itself took ~140s > -tab, matcell- took <5s, but indeed generates frequencies only. > a second -tabulate, sum-, even with nof and nost, took the same > also with the caplog wrapper > a -collapse, fast- took 36s, but of course this loses the data > the -summarize- took 92s without the postfiles, 34.53s with — but I > still cannot scatteri the results in the same Stata instance… > > On a 64-core StataMP 13 (in a cluster, with nodes of 8 cores plus MPIing): > full -tabulate, sum- itself took ~195s > -tab, matcell-: 8s > again same speed without frequencies and standard deviations, or with > the wrapper, for -tab, sum- > -collapse- took 60s > the loops of -summarize- took 160s now without the postfiles, 47s with. > > Thanks! > > Laszlo > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/