Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
From
Nick Cox <[email protected]>
To
"[email protected]" <[email protected]>
Subject
Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
Date
Mon, 19 Aug 2013 17:01:19 +0100
These reservations don't change my advice.
1. For graphics it is easiest by far to have variables to graph. I use
-scatteri- for special effects, but you'd have to factor in
programming time to get that right.
2. Trying to avoid a -sort- strikes me as false economy here. -by:-
goes hand in hand with a -sort- and the alternative of some kind of
loop and -if- can't compete well, in my experience.
Nick
[email protected]
On 19 August 2013 16:39, László Sándor <[email protected]> wrote:
> Thanks, Nick.
>
> I would much prefer not to have a new dataset. I want to plot (and
> maybe log) the binned means, and to use the immediate -scatteri-, I
> need no new variables but probably a few macros with the means stored
> in them.
>
> I am not sure if Mata would be faster than the looping tests we ran
> for otherwise optimised -sum, meanonly-, they were surprisingly fast.
> No, I haven't tried either.
>
> -_gmean.ado- does start with a sort, which is prohibitive with big
> data as the primary use case.
>
> On Mon, Aug 19, 2013 at 11:32 AM, Nick Cox <[email protected]> wrote:
>> I got lost somewhere early on in this thread, but I don't see mention
>> towards the end of some other possibilities.
>>
>> If I understand it correctly, László wants a fast implementation
>> equivalent to -tabulate, summarize- in place, i.e. without replacing
>> the dataset by a reduced dataset, but with results saved to the new
>> dataset.
>>
>> As earlier emphasised, he can't borrow or steal code from -tabulate,
>> summarize- because that is compiled C code invisible to anyone except
>> Stata developers (and they aren't about to show you (and you almost
>> certainly wouldn't benefit from the code any way as it probably
>> depends on lots of other C code (if it's typical of C code of this
>> kind (or at least that's my guess)))).
>>
>> To that end, my thoughts are
>>
>> 1. Never use reduction commands if you don't want data reduction,
>> unless you can be sure that the overhead of reading the data in again
>> can be ignored and you can't think of a better method.
>>
>> 2. The possibility of using Mata was not mentioned (but no, I don't
>> have code up my sleeve).
>>
>> 3. Although -egen- itself is slow, the code at the heart of _gmean.ado
>> and _gsd.ado is where I would start. That uses -by:- and avoids a loop
>> and although there are a few lines to be interpreted I would expect it
>> to be pretty fast.
>>
>> Nick
>> [email protected]
>>
>>
>> On 19 August 2013 15:45, László Sándor <[email protected]> wrote:
>>> Thanks, all.
>>>
>>> I am still confused how I could combine the speed of sum of the
>>> methods like -collapse- without losing my data, which usually takes
>>> dozens of GBs.
>>>
>>> Otherwise I think we are only talking about -tabulate- versus -table-
>>> but both need log-parsing, or some -bys: summarize- and collecting
>>> locals, which I did not attempt.
>>>
>>> FWIW, I also ran Roger's tests. Actually, I am surprised by the speed
>>> of the many lines of -summarize, meanonly-, esp. as it runs over the
>>> dataset many times just ifs in different observations.
>>>
>>> On an 8-core StataMP 13 for Linux,
>>> full -tabulate, sum- itself took ~140s
>>> -tab, matcell- took <5s, but indeed generates frequencies only.
>>> a second -tabulate, sum-, even with nof and nost, took the same
>>> also with the caplog wrapper
>>> a -collapse, fast- took 36s, but of course this loses the data
>>> the -summarize- took 92s without the postfiles, 34.53s with — but I
>>> still cannot scatteri the results in the same Stata instance…
>>>
>>> On a 64-core StataMP 13 (in a cluster, with nodes of 8 cores plus MPIing):
>>> full -tabulate, sum- itself took ~195s
>>> -tab, matcell-: 8s
>>> again same speed without frequencies and standard deviations, or with
>>> the wrapper, for -tab, sum-
>>> -collapse- took 60s
>>> the loops of -summarize- took 160s now without the postfiles, 47s with.
>>>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/