Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: how to parallelize Mata (or steal the performance of built-in -tab, summarize-)
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: how to parallelize Mata (or steal the performance of built-in -tab, summarize-)
Date
Tue, 3 Apr 2012 17:09:19 +0100
I've remembered or reconstructed the logic of 1997. The correction
factor made sense for -binsm- but should be omitted here.
On Tue, Apr 3, 2012 at 5:05 PM, Nick Cox <[email protected]> wrote:
> I don't know what that - 0.5 * width term is doing there. Some ancient
> illogic, I guess.
>
> On Tue, Apr 3, 2012 at 4:58 PM, Nick Cox <[email protected]> wrote:
>> I had a hack at my -binsm- (STB, SJ) to get a slightly more modern
>> flavour. Code follows examples.
>>
>> At present you can have any bins you like so long as they are defined
>> by round(xvar, #) in the first instance. But if # is 0, the distinct
>> values of the x variable are used, so you can use a previously-defined
>> binning variable.
>>
>> . sysuse auto, clear
>> (1978 Automobile Data)
>>
>> . binmean mpg weight
>>
>> . binmean mpg weight, width(100)
>>
>> . binmean mpg weight, width(100) by(foreign, compact)
>>
>> . binmean mpg weight, width(100) recast(bar) barw(100) base(0)
>>
>> . binmean mpg weight, width(100) recast(connected)
>>
>> . binmean turn trunk weight, width(100) recast(connected)
>>
>> *! 1.0.0 NJC 3 April 2012
>> program binmean
>> version 8.2
>> syntax varlist(min=2 numeric) [if] [in] ///
>> [ , Width(numlist min=1) BY(str) PLOT(str asis) ///
>> ADDPLOT(str asis) * ]
>>
>> quietly {
>> if "`width'" == "" local width 0
>> local text "bin width `width'"
>>
>> if `"`by'"' != "" {
>> gettoken byvar byrest : by, parse(,)
>> gettoken comma byrest : byrest, parse(,)
>> local byby `"by(`byvar', note(`text') `byrest')"'
>> }
>> else local byby "note(`text')"
>>
>> marksample touse
>> if "`byvar'" != "" markout `touse' `byvar', strok
>> count if `touse'
>> if r(N) == 0 error 2000
>>
>> preserve
>> keep if `touse'
>> keep `varlist' `byvar'
>> local nv : word count `varlist'
>> local x : word `nv' of `varlist'
>> local Y : list varlist - x
>>
>> tempvar xbin work
>> clonevar `xbin' = `x'
>> replace `xbin' = round(`x', `width') - 0.5 * `width'
>>
>> foreach y of local Y {
>> tempvar ymean
>> clonevar `ymean' = `y'
>> bysort `xbin' `byvar' : replace `ymean' = sum(`y') / _N
>> local yshow `yshow' `ymean'
>> }
>>
>> bysort `xbin' `byvar': keep if _n == _N
>> }
>>
>> scatter `yshow' `xbin', ///
>> `byby' `options' || ///
>> || `plot' ///
>> || `addplot'
>> end
>>
>>
>> 2012/4/3 László Sándor <[email protected]>:
>>> Thanks, Nick, this is very helpful.
>>>
>>> -binsm- does something different, but I'll have a look and see what I
>>> could adapt from its source.
>>>
>>> -twoway__histogram_gen- is about frequencies still, but something like
>>> this is a great idea. Actually, if I could find a routine like this
>>> for bar or line graphs, it probably does what I need (and then I would
>>> be really surprised if that would still be slower than -tab, sum()-
>>>
>>> Sadly, there is no twoway__line_gen or twoway__bar_gen, and other
>>> searches did not help.
>>>
>>> But this was very educational, thanks again!
>>>
>>> Laszlo
>>>
>>> On Tue, Apr 3, 2012 at 5:01 AM, Nick Cox <[email protected]> wrote:
>>>>
>>>> Overnight I remembered -binsm-
>>>>
>>>> SJ-6-1 gr26_1 . . . . . . . . . . . . . . . . . . Software update for binsm
>>>> (help binsm if installed) . . . . . . . . . . . . . . . . . N. J. Cox
>>>> Q1/06 SJ 6(1):151
>>>> rewritten to support modern Stata graphics
>>>>
>>>> STB-37 gr26 . . . . . . . . . . . Bin smoothing and summary on scatter plots
>>>> (help binsm if installed) . . . . . . . . . . . . . . . . . N. J. Cox
>>>> 5/97 pp.9--12; STB Reprints Vol 7, pp.59--63
>>>> alternative to graph, twoway bands(); produces a scatterplot
>>>> of yvar against xvar with one or more summaries of yvar for bins
>>>> of xvar
>>>>
>>>> and -twoway__histogram_gen-
>>>>
>>>> SJ-5-2 gr0014 . . . . . . . Stata tip 20: Generating histogram bin variables
>>>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . D. A. Harrison
>>>> Q2/05 SJ 5(2):280--281 (no commands)
>>>> tip illustrating the use of twoway__histogram_gen for
>>>> creation of complex histograms and other graphs or tables
>>>>
>>>> My strategic advice is this. You want a reduced dataset for graphing,
>>>> so -drop- aggressively. Once you have identified observations "to
>>>> use", go
>>>>
>>>> keep if `touse'
>>>> drop `touse'
>>>>
>>>> Once the mean is in the last observation of every block of
>>>> observations, -drop- all the others.
>>>>
>>>>
>>>> 2012/4/3 László Sándor <[email protected]>:
>>>> > Thanks for this, Nick.
>>>> >
>>>> > I found my (plenty and embarrassing) mistakes in my code, below is a
>>>> > neater version that also actually does what it should, or so it seems.
>>>> >
>>>> > That said, it is still rarely faster than logging -tab, sum()- though
>>>> > with many millions of observations, running on many (>4) cores, it at
>>>> > least has a little advantage. (But both beat my bare bones Mata
>>>> > attempts.)
>>>> >
>>>> > I would still be a bit curious how secret the secret sauce of
>>>> > StataCorp is for this, as this "collapsing" is pretty commonplace for
>>>> > many descriptives (also bar graphs, line graphs etc), and while they
>>>> > are rightly proud if they could tweak -tabulate- to run this fast,
>>>> > they perhaps could let us (and themselves) working towards other
>>>> > similar code also running faster. Though, of course, there must be a
>>>> > reason (general purpose etc.) while this is harder elsewhere.
>>>> >
>>>> > Thanks again,
>>>> >
>>>> > Laszlo
>>>> >
>>>> > tempvar wsum tag
>>>> >
>>>> > if ("`y2_var'"!="") local y2 y2
>>>> > else local y2 ""
>>>> >
>>>> > sort `x_q' `touse'
>>>> > by `x_q' `touse': g byte `tag' = _n == _N
>>>> > if ("`weight1'"!="") by `x_q' `touse': g `wsum' = sum(`weight1')
>>>> > else by `x_q' `touse': g `wsum' = _N
>>>> >
>>>> > foreach v in x y `y2' {
>>>> > if ("`weight1'"!="") by `x_q' `touse': g ``v'_mean' = sum(``v'_r'*`weight1')
>>>> > else by `x_q' `touse': g ``v'_mean' = sum(``v'_r')
>>>> >
>>>> > quietly replace ``v'_mean' = cond(`tag' & `touse',``v'_mean'/`wsum',.)
>>>> > }
>>>> >
>>>> > On Mon, Apr 2, 2012 at 6:11 PM, Nick Cox <[email protected]> wrote:
>>>> >>
>>>> >> I will look at it tomorrow.
>>>> >>
>>>> >> 2012/4/2 László Sándor <[email protected]>:
>>>> >> > Nick,
>>>> >> >
>>>> >> > thanks, I did follow up with your post. Sadly, I could not easily get
>>>> >> > -by- working, or to be precise, to use the variables that it
>>>> >> > generated. Below I have an attempt, if I can take liberty with your
>>>> >> > time and expect you to parse it, I am grateful for comments to get it
>>>> >> > working -- the indexing must be off. It tries to average two (x_r and
>>>> >> > y_r) or three (y2_r extra) variables. It generates too large values
>>>> >> > for some bins (i.e. from U[0,1] variables some averages become larger
>>>> >> > than 20.)
>>>> >> >
>>>> >> > I am happy if someone from StataCorp follows up too! :)
>>>> >> >
>>>> >> > Thanks,
>>>> >> >
>>>> >> > László
>>>> >> >
>>>> >> > tempvar wsum tag ones
>>>> >> > g byte `ones' = 1
>>>> >> >
>>>> >> >
>>>> >> > if ("`y2_var'"!="") local y2 y2
>>>> >> > else local y2 ""
>>>> >> >
>>>> >> >
>>>> >> > if ("`weight1'"!="") g `wsum' = sum(`weight1') if `touse'
>>>> >> > else g `wsum' = sum(`ones') if `touse'
>>>> >> >
>>>> >> >
>>>> >> > sort `x_q'
>>>> >> > by `x_q': g byte `tag' = _N if `touse'
>>>> >> >
>>>> >> > foreach v in x y `y2' {
>>>> >> > if "`weight1'"!=""{
>>>> >> > by `x_q': g ``v'_mean' = sum(``v'_r'*`weight1') if `touse'
>>>> >> > by `x_q': replace ``v'_mean' = ``v'_mean'/`wsum' if `tag' & `touse'
>>>> >> > }
>>>> >> >
>>>> >> > else {
>>>> >> > by `x_q': g ``v'_mean' = sum(``v'_r') if `touse'
>>>> >> > by `x_q': replace ``v'_mean' = ``v'_mean'/`wsum' if `tag' & `touse'
>>>> >> > }
>>>> >> > }
>>>> >> >
>>>> >> >
>>>> >> > On Mon, Apr 2, 2012 at 3:36 PM, Nick Cox <[email protected]> wrote:
>>>> >> >>
>>>> >> >> We are back to the questions you asked a week ago. Mostly this is for
>>>> >> >> StataCorp. Otherwise please see again my answers at
>>>> >> >>
>>>> >> >> http://www.stata.com/statalist/archive/2012-03/msg01144.html
>>>> >> >>
>>>> >> >> I've had dramatic speed-ups with Mata -- my record is reducing
>>>> >> >> execution time from 5 days to 2 minutes, but that was partly because
>>>> >> >> my original code was so dumb -- but I've not tried anything like the
>>>> >> >> stuff you were using.
>>>> >> >>
>>>> >> >> -tabulate, summarize- is compiled C code. I think the nearest you can
>>>> >> >> get is by using -by:- as explained in the post just quoted.
>>>> >> >>
>>>> >> >> Nick
>>>> >> >>
>>>> >> >> 2012/4/2 László Sándor <[email protected]>:
>>>> >> >> > Hi all,
>>>> >> >> >
>>>> >> >> > I had several questions recently on this list about compiling Mata
>>>> >> >> > code. I still could not deal with generating the compile time locals
>>>> >> >> > with loops, but I typed them out and compiled. Now I had my test runs
>>>> >> >> > but they are surprising. Let me ask you why:
>>>> >> >> >
>>>> >> >> > My basic problem was to do a fast "collapse" to make binned scatter
>>>> >> >> > plots. Collapse was unacceptably slow, probably because of the
>>>> >> >> > necessary preserve-restore cycles, or inefficient coding of collapse
>>>> >> >> > (for its general purpose).
>>>> >> >> >
>>>> >> >> > I already had a version that parsed a log of -tabulate, summarize-.
>>>> >> >> > Yes, it is as much of a hack as it sounds like. I was not expecting
>>>> >> >> > this to be fast, at least because of the file I/O and the parsing.
>>>> >> >> >
>>>> >> >> > Now I built a Mata function that "collapses" into new variables with
>>>> >> >> > leaving the data intact otherwise. For this I used Ben Jann's
>>>> >> >> > -mf_mm_collapse-, and compiled all the necessary functions myself in
>>>> >> >> > the ado file.
>>>> >> >> >
>>>> >> >> > And the test run with 100 million observations told me it was slower
>>>> >> >> > than the hack. Before I give up and claim the hack unbeatable, I have
>>>> >> >> > one suspicion. I had the test run on Stata 12 MP on a cluster, with
>>>> >> >> > 12
>>>> >> >> > cores. Perhaps -tabulate- used all of them, and my code did not.
>>>> >> >> >
>>>> >> >> > Are there guidelines how to speed up Mata in this situation (if it is
>>>> >> >> > not MP-aware to begin with?).
>>>> >> >> >
>>>> >> >> > Or, tentatively, can I ask for some guidance about the magic of
>>>> >> >> > -tabulate, summarize-? Is that magic accessible/reproducible without
>>>> >> >> > just logging its output?
>>>> >> >> >
>>>>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/