Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: how to parallelize Mata (or steal the performance of built-in -tab, summarize-)
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: how to parallelize Mata (or steal the performance of built-in -tab, summarize-)
Date
Tue, 3 Apr 2012 17:05:11 +0100
I don't know what that - 0.5 * width term is doing there. Some ancient
illogic, I guess.
On Tue, Apr 3, 2012 at 4:58 PM, Nick Cox <[email protected]> wrote:
> I had a hack at my -binsm- (STB, SJ) to get a slightly more modern
> flavour. Code follows examples.
>
> At present you can have any bins you like so long as they are defined
> by round(xvar, #) in the first instance. But if # is 0, the distinct
> values of the x variable are used, so you can use a previously-defined
> binning variable.
>
> . sysuse auto, clear
> (1978 Automobile Data)
>
> . binmean mpg weight
>
> . binmean mpg weight, width(100)
>
> . binmean mpg weight, width(100) by(foreign, compact)
>
> . binmean mpg weight, width(100) recast(bar) barw(100) base(0)
>
> . binmean mpg weight, width(100) recast(connected)
>
> . binmean turn trunk weight, width(100) recast(connected)
>
> *! 1.0.0 NJC 3 April 2012
> program binmean
> version 8.2
> syntax varlist(min=2 numeric) [if] [in] ///
> [ , Width(numlist min=1) BY(str) PLOT(str asis) ///
> ADDPLOT(str asis) * ]
>
> quietly {
> if "`width'" == "" local width 0
> local text "bin width `width'"
>
> if `"`by'"' != "" {
> gettoken byvar byrest : by, parse(,)
> gettoken comma byrest : byrest, parse(,)
> local byby `"by(`byvar', note(`text') `byrest')"'
> }
> else local byby "note(`text')"
>
> marksample touse
> if "`byvar'" != "" markout `touse' `byvar', strok
> count if `touse'
> if r(N) == 0 error 2000
>
> preserve
> keep if `touse'
> keep `varlist' `byvar'
> local nv : word count `varlist'
> local x : word `nv' of `varlist'
> local Y : list varlist - x
>
> tempvar xbin work
> clonevar `xbin' = `x'
> replace `xbin' = round(`x', `width') - 0.5 * `width'
>
> foreach y of local Y {
> tempvar ymean
> clonevar `ymean' = `y'
> bysort `xbin' `byvar' : replace `ymean' = sum(`y') / _N
> local yshow `yshow' `ymean'
> }
>
> bysort `xbin' `byvar': keep if _n == _N
> }
>
> scatter `yshow' `xbin', ///
> `byby' `options' || ///
> || `plot' ///
> || `addplot'
> end
>
>
> 2012/4/3 László Sándor <[email protected]>:
>> Thanks, Nick, this is very helpful.
>>
>> -binsm- does something different, but I'll have a look and see what I
>> could adapt from its source.
>>
>> -twoway__histogram_gen- is about frequencies still, but something like
>> this is a great idea. Actually, if I could find a routine like this
>> for bar or line graphs, it probably does what I need (and then I would
>> be really surprised if that would still be slower than -tab, sum()-
>>
>> Sadly, there is no twoway__line_gen or twoway__bar_gen, and other
>> searches did not help.
>>
>> But this was very educational, thanks again!
>>
>> Laszlo
>>
>> On Tue, Apr 3, 2012 at 5:01 AM, Nick Cox <[email protected]> wrote:
>>>
>>> Overnight I remembered -binsm-
>>>
>>> SJ-6-1 gr26_1 . . . . . . . . . . . . . . . . . . Software update for binsm
>>> (help binsm if installed) . . . . . . . . . . . . . . . . . N. J. Cox
>>> Q1/06 SJ 6(1):151
>>> rewritten to support modern Stata graphics
>>>
>>> STB-37 gr26 . . . . . . . . . . . Bin smoothing and summary on scatter plots
>>> (help binsm if installed) . . . . . . . . . . . . . . . . . N. J. Cox
>>> 5/97 pp.9--12; STB Reprints Vol 7, pp.59--63
>>> alternative to graph, twoway bands(); produces a scatterplot
>>> of yvar against xvar with one or more summaries of yvar for bins
>>> of xvar
>>>
>>> and -twoway__histogram_gen-
>>>
>>> SJ-5-2 gr0014 . . . . . . . Stata tip 20: Generating histogram bin variables
>>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . D. A. Harrison
>>> Q2/05 SJ 5(2):280--281 (no commands)
>>> tip illustrating the use of twoway__histogram_gen for
>>> creation of complex histograms and other graphs or tables
>>>
>>> My strategic advice is this. You want a reduced dataset for graphing,
>>> so -drop- aggressively. Once you have identified observations "to
>>> use", go
>>>
>>> keep if `touse'
>>> drop `touse'
>>>
>>> Once the mean is in the last observation of every block of
>>> observations, -drop- all the others.
>>>
>>>
>>> 2012/4/3 László Sándor <[email protected]>:
>>> > Thanks for this, Nick.
>>> >
>>> > I found my (plenty and embarrassing) mistakes in my code, below is a
>>> > neater version that also actually does what it should, or so it seems.
>>> >
>>> > That said, it is still rarely faster than logging -tab, sum()- though
>>> > with many millions of observations, running on many (>4) cores, it at
>>> > least has a little advantage. (But both beat my bare bones Mata
>>> > attempts.)
>>> >
>>> > I would still be a bit curious how secret the secret sauce of
>>> > StataCorp is for this, as this "collapsing" is pretty commonplace for
>>> > many descriptives (also bar graphs, line graphs etc), and while they
>>> > are rightly proud if they could tweak -tabulate- to run this fast,
>>> > they perhaps could let us (and themselves) working towards other
>>> > similar code also running faster. Though, of course, there must be a
>>> > reason (general purpose etc.) while this is harder elsewhere.
>>> >
>>> > Thanks again,
>>> >
>>> > Laszlo
>>> >
>>> > tempvar wsum tag
>>> >
>>> > if ("`y2_var'"!="") local y2 y2
>>> > else local y2 ""
>>> >
>>> > sort `x_q' `touse'
>>> > by `x_q' `touse': g byte `tag' = _n == _N
>>> > if ("`weight1'"!="") by `x_q' `touse': g `wsum' = sum(`weight1')
>>> > else by `x_q' `touse': g `wsum' = _N
>>> >
>>> > foreach v in x y `y2' {
>>> > if ("`weight1'"!="") by `x_q' `touse': g ``v'_mean' = sum(``v'_r'*`weight1')
>>> > else by `x_q' `touse': g ``v'_mean' = sum(``v'_r')
>>> >
>>> > quietly replace ``v'_mean' = cond(`tag' & `touse',``v'_mean'/`wsum',.)
>>> > }
>>> >
>>> > On Mon, Apr 2, 2012 at 6:11 PM, Nick Cox <[email protected]> wrote:
>>> >>
>>> >> I will look at it tomorrow.
>>> >>
>>> >> 2012/4/2 László Sándor <[email protected]>:
>>> >> > Nick,
>>> >> >
>>> >> > thanks, I did follow up with your post. Sadly, I could not easily get
>>> >> > -by- working, or to be precise, to use the variables that it
>>> >> > generated. Below I have an attempt, if I can take liberty with your
>>> >> > time and expect you to parse it, I am grateful for comments to get it
>>> >> > working -- the indexing must be off. It tries to average two (x_r and
>>> >> > y_r) or three (y2_r extra) variables. It generates too large values
>>> >> > for some bins (i.e. from U[0,1] variables some averages become larger
>>> >> > than 20.)
>>> >> >
>>> >> > I am happy if someone from StataCorp follows up too! :)
>>> >> >
>>> >> > Thanks,
>>> >> >
>>> >> > László
>>> >> >
>>> >> > tempvar wsum tag ones
>>> >> > g byte `ones' = 1
>>> >> >
>>> >> >
>>> >> > if ("`y2_var'"!="") local y2 y2
>>> >> > else local y2 ""
>>> >> >
>>> >> >
>>> >> > if ("`weight1'"!="") g `wsum' = sum(`weight1') if `touse'
>>> >> > else g `wsum' = sum(`ones') if `touse'
>>> >> >
>>> >> >
>>> >> > sort `x_q'
>>> >> > by `x_q': g byte `tag' = _N if `touse'
>>> >> >
>>> >> > foreach v in x y `y2' {
>>> >> > if "`weight1'"!=""{
>>> >> > by `x_q': g ``v'_mean' = sum(``v'_r'*`weight1') if `touse'
>>> >> > by `x_q': replace ``v'_mean' = ``v'_mean'/`wsum' if `tag' & `touse'
>>> >> > }
>>> >> >
>>> >> > else {
>>> >> > by `x_q': g ``v'_mean' = sum(``v'_r') if `touse'
>>> >> > by `x_q': replace ``v'_mean' = ``v'_mean'/`wsum' if `tag' & `touse'
>>> >> > }
>>> >> > }
>>> >> >
>>> >> >
>>> >> > On Mon, Apr 2, 2012 at 3:36 PM, Nick Cox <[email protected]> wrote:
>>> >> >>
>>> >> >> We are back to the questions you asked a week ago. Mostly this is for
>>> >> >> StataCorp. Otherwise please see again my answers at
>>> >> >>
>>> >> >> http://www.stata.com/statalist/archive/2012-03/msg01144.html
>>> >> >>
>>> >> >> I've had dramatic speed-ups with Mata -- my record is reducing
>>> >> >> execution time from 5 days to 2 minutes, but that was partly because
>>> >> >> my original code was so dumb -- but I've not tried anything like the
>>> >> >> stuff you were using.
>>> >> >>
>>> >> >> -tabulate, summarize- is compiled C code. I think the nearest you can
>>> >> >> get is by using -by:- as explained in the post just quoted.
>>> >> >>
>>> >> >> Nick
>>> >> >>
>>> >> >> 2012/4/2 László Sándor <[email protected]>:
>>> >> >> > Hi all,
>>> >> >> >
>>> >> >> > I had several questions recently on this list about compiling Mata
>>> >> >> > code. I still could not deal with generating the compile time locals
>>> >> >> > with loops, but I typed them out and compiled. Now I had my test runs
>>> >> >> > but they are surprising. Let me ask you why:
>>> >> >> >
>>> >> >> > My basic problem was to do a fast "collapse" to make binned scatter
>>> >> >> > plots. Collapse was unacceptably slow, probably because of the
>>> >> >> > necessary preserve-restore cycles, or inefficient coding of collapse
>>> >> >> > (for its general purpose).
>>> >> >> >
>>> >> >> > I already had a version that parsed a log of -tabulate, summarize-.
>>> >> >> > Yes, it is as much of a hack as it sounds like. I was not expecting
>>> >> >> > this to be fast, at least because of the file I/O and the parsing.
>>> >> >> >
>>> >> >> > Now I built a Mata function that "collapses" into new variables with
>>> >> >> > leaving the data intact otherwise. For this I used Ben Jann's
>>> >> >> > -mf_mm_collapse-, and compiled all the necessary functions myself in
>>> >> >> > the ado file.
>>> >> >> >
>>> >> >> > And the test run with 100 million observations told me it was slower
>>> >> >> > than the hack. Before I give up and claim the hack unbeatable, I have
>>> >> >> > one suspicion. I had the test run on Stata 12 MP on a cluster, with
>>> >> >> > 12
>>> >> >> > cores. Perhaps -tabulate- used all of them, and my code did not.
>>> >> >> >
>>> >> >> > Are there guidelines how to speed up Mata in this situation (if it is
>>> >> >> > not MP-aware to begin with?).
>>> >> >> >
>>> >> >> > Or, tentatively, can I ask for some guidance about the magic of
>>> >> >> > -tabulate, summarize-? Is that magic accessible/reproducible without
>>> >> >> > just logging its output?
>>> >> >> >
>>>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/