Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
From
Nick Cox <[email protected]>
To
"[email protected]" <[email protected]>
Subject
Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
Date
Mon, 19 Aug 2013 16:32:06 +0100
I got lost somewhere early on in this thread, but I don't see mention
towards the end of some other possibilities.
If I understand it correctly, László wants a fast implementation
equivalent to -tabulate, summarize- in place, i.e. without replacing
the dataset by a reduced dataset, but with results saved to the new
dataset.
As earlier emphasised, he can't borrow or steal code from -tabulate,
summarize- because that is compiled C code invisible to anyone except
Stata developers (and they aren't about to show you (and you almost
certainly wouldn't benefit from the code any way as it probably
depends on lots of other C code (if it's typical of C code of this
kind (or at least that's my guess)))).
To that end, my thoughts are
1. Never use reduction commands if you don't want data reduction,
unless you can be sure that the overhead of reading the data in again
can be ignored and you can't think of a better method.
2. The possibility of using Mata was not mentioned (but no, I don't
have code up my sleeve).
3. Although -egen- itself is slow, the code at the heart of _gmean.ado
and _gsd.ado is where I would start. That uses -by:- and avoids a loop
and although there are a few lines to be interpreted I would expect it
to be pretty fast.
Nick
[email protected]
On 19 August 2013 15:45, László Sándor <[email protected]> wrote:
> Thanks, all.
>
> I am still confused how I could combine the speed of sum of the
> methods like -collapse- without losing my data, which usually takes
> dozens of GBs.
>
> Otherwise I think we are only talking about -tabulate- versus -table-
> but both need log-parsing, or some -bys: summarize- and collecting
> locals, which I did not attempt.
>
> FWIW, I also ran Roger's tests. Actually, I am surprised by the speed
> of the many lines of -summarize, meanonly-, esp. as it runs over the
> dataset many times just ifs in different observations.
>
> On an 8-core StataMP 13 for Linux,
> full -tabulate, sum- itself took ~140s
> -tab, matcell- took <5s, but indeed generates frequencies only.
> a second -tabulate, sum-, even with nof and nost, took the same
> also with the caplog wrapper
> a -collapse, fast- took 36s, but of course this loses the data
> the -summarize- took 92s without the postfiles, 34.53s with — but I
> still cannot scatteri the results in the same Stata instance…
>
> On a 64-core StataMP 13 (in a cluster, with nodes of 8 cores plus MPIing):
> full -tabulate, sum- itself took ~195s
> -tab, matcell-: 8s
> again same speed without frequencies and standard deviations, or with
> the wrapper, for -tab, sum-
> -collapse- took 60s
> the loops of -summarize- took 160s now without the postfiles, 47s with.
>
> Thanks!
>
> Laszlo
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/