Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
From
László Sándor <[email protected]>
To
[email protected]
Subject
Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
Date
Mon, 19 Aug 2013 11:39:39 -0400
Thanks, Nick.
I would much prefer not to have a new dataset. I want to plot (and
maybe log) the binned means, and to use the immediate -scatteri-, I
need no new variables but probably a few macros with the means stored
in them.
I am not sure if Mata would be faster than the looping tests we ran
for otherwise optimised -sum, meanonly-, they were surprisingly fast.
No, I haven't tried either.
-_gmean.ado- does start with a sort, which is prohibitive with big
data as the primary use case.
On Mon, Aug 19, 2013 at 11:32 AM, Nick Cox <[email protected]> wrote:
> I got lost somewhere early on in this thread, but I don't see mention
> towards the end of some other possibilities.
>
> If I understand it correctly, László wants a fast implementation
> equivalent to -tabulate, summarize- in place, i.e. without replacing
> the dataset by a reduced dataset, but with results saved to the new
> dataset.
>
> As earlier emphasised, he can't borrow or steal code from -tabulate,
> summarize- because that is compiled C code invisible to anyone except
> Stata developers (and they aren't about to show you (and you almost
> certainly wouldn't benefit from the code any way as it probably
> depends on lots of other C code (if it's typical of C code of this
> kind (or at least that's my guess)))).
>
> To that end, my thoughts are
>
> 1. Never use reduction commands if you don't want data reduction,
> unless you can be sure that the overhead of reading the data in again
> can be ignored and you can't think of a better method.
>
> 2. The possibility of using Mata was not mentioned (but no, I don't
> have code up my sleeve).
>
> 3. Although -egen- itself is slow, the code at the heart of _gmean.ado
> and _gsd.ado is where I would start. That uses -by:- and avoids a loop
> and although there are a few lines to be interpreted I would expect it
> to be pretty fast.
>
> Nick
> [email protected]
>
>
> On 19 August 2013 15:45, László Sándor <[email protected]> wrote:
>> Thanks, all.
>>
>> I am still confused how I could combine the speed of sum of the
>> methods like -collapse- without losing my data, which usually takes
>> dozens of GBs.
>>
>> Otherwise I think we are only talking about -tabulate- versus -table-
>> but both need log-parsing, or some -bys: summarize- and collecting
>> locals, which I did not attempt.
>>
>> FWIW, I also ran Roger's tests. Actually, I am surprised by the speed
>> of the many lines of -summarize, meanonly-, esp. as it runs over the
>> dataset many times just ifs in different observations.
>>
>> On an 8-core StataMP 13 for Linux,
>> full -tabulate, sum- itself took ~140s
>> -tab, matcell- took <5s, but indeed generates frequencies only.
>> a second -tabulate, sum-, even with nof and nost, took the same
>> also with the caplog wrapper
>> a -collapse, fast- took 36s, but of course this loses the data
>> the -summarize- took 92s without the postfiles, 34.53s with — but I
>> still cannot scatteri the results in the same Stata instance…
>>
>> On a 64-core StataMP 13 (in a cluster, with nodes of 8 cores plus MPIing):
>> full -tabulate, sum- itself took ~195s
>> -tab, matcell-: 8s
>> again same speed without frequencies and standard deviations, or with
>> the wrapper, for -tab, sum-
>> -collapse- took 60s
>> the loops of -summarize- took 160s now without the postfiles, 47s with.
>>
>> Thanks!
>>
>> Laszlo
>>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/