Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <njcoxstata@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: how to parallelize Mata (or steal the performance of built-in -tab, summarize-) |
Date | Tue, 3 Apr 2012 17:05:11 +0100 |
I don't know what that - 0.5 * width term is doing there. Some ancient illogic, I guess. On Tue, Apr 3, 2012 at 4:58 PM, Nick Cox <njcoxstata@gmail.com> wrote: > I had a hack at my -binsm- (STB, SJ) to get a slightly more modern > flavour. Code follows examples. > > At present you can have any bins you like so long as they are defined > by round(xvar, #) in the first instance. But if # is 0, the distinct > values of the x variable are used, so you can use a previously-defined > binning variable. > > . sysuse auto, clear > (1978 Automobile Data) > > . binmean mpg weight > > . binmean mpg weight, width(100) > > . binmean mpg weight, width(100) by(foreign, compact) > > . binmean mpg weight, width(100) recast(bar) barw(100) base(0) > > . binmean mpg weight, width(100) recast(connected) > > . binmean turn trunk weight, width(100) recast(connected) > > *! 1.0.0 NJC 3 April 2012 > program binmean > version 8.2 > syntax varlist(min=2 numeric) [if] [in] /// > [ , Width(numlist min=1) BY(str) PLOT(str asis) /// > ADDPLOT(str asis) * ] > > quietly { > if "`width'" == "" local width 0 > local text "bin width `width'" > > if `"`by'"' != "" { > gettoken byvar byrest : by, parse(,) > gettoken comma byrest : byrest, parse(,) > local byby `"by(`byvar', note(`text') `byrest')"' > } > else local byby "note(`text')" > > marksample touse > if "`byvar'" != "" markout `touse' `byvar', strok > count if `touse' > if r(N) == 0 error 2000 > > preserve > keep if `touse' > keep `varlist' `byvar' > local nv : word count `varlist' > local x : word `nv' of `varlist' > local Y : list varlist - x > > tempvar xbin work > clonevar `xbin' = `x' > replace `xbin' = round(`x', `width') - 0.5 * `width' > > foreach y of local Y { > tempvar ymean > clonevar `ymean' = `y' > bysort `xbin' `byvar' : replace `ymean' = sum(`y') / _N > local yshow `yshow' `ymean' > } > > bysort `xbin' `byvar': keep if _n == _N > } > > scatter `yshow' `xbin', /// > `byby' `options' || /// > || `plot' /// > || `addplot' > end > > > 2012/4/3 László Sándor <sandorl@gmail.com>: >> Thanks, Nick, this is very helpful. >> >> -binsm- does something different, but I'll have a look and see what I >> could adapt from its source. >> >> -twoway__histogram_gen- is about frequencies still, but something like >> this is a great idea. Actually, if I could find a routine like this >> for bar or line graphs, it probably does what I need (and then I would >> be really surprised if that would still be slower than -tab, sum()- >> >> Sadly, there is no twoway__line_gen or twoway__bar_gen, and other >> searches did not help. >> >> But this was very educational, thanks again! >> >> Laszlo >> >> On Tue, Apr 3, 2012 at 5:01 AM, Nick Cox <njcoxstata@gmail.com> wrote: >>> >>> Overnight I remembered -binsm- >>> >>> SJ-6-1 gr26_1 . . . . . . . . . . . . . . . . . . Software update for binsm >>> (help binsm if installed) . . . . . . . . . . . . . . . . . N. J. Cox >>> Q1/06 SJ 6(1):151 >>> rewritten to support modern Stata graphics >>> >>> STB-37 gr26 . . . . . . . . . . . Bin smoothing and summary on scatter plots >>> (help binsm if installed) . . . . . . . . . . . . . . . . . N. J. Cox >>> 5/97 pp.9--12; STB Reprints Vol 7, pp.59--63 >>> alternative to graph, twoway bands(); produces a scatterplot >>> of yvar against xvar with one or more summaries of yvar for bins >>> of xvar >>> >>> and -twoway__histogram_gen- >>> >>> SJ-5-2 gr0014 . . . . . . . Stata tip 20: Generating histogram bin variables >>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . D. A. Harrison >>> Q2/05 SJ 5(2):280--281 (no commands) >>> tip illustrating the use of twoway__histogram_gen for >>> creation of complex histograms and other graphs or tables >>> >>> My strategic advice is this. You want a reduced dataset for graphing, >>> so -drop- aggressively. Once you have identified observations "to >>> use", go >>> >>> keep if `touse' >>> drop `touse' >>> >>> Once the mean is in the last observation of every block of >>> observations, -drop- all the others. >>> >>> >>> 2012/4/3 László Sándor <sandorl@gmail.com>: >>> > Thanks for this, Nick. >>> > >>> > I found my (plenty and embarrassing) mistakes in my code, below is a >>> > neater version that also actually does what it should, or so it seems. >>> > >>> > That said, it is still rarely faster than logging -tab, sum()- though >>> > with many millions of observations, running on many (>4) cores, it at >>> > least has a little advantage. (But both beat my bare bones Mata >>> > attempts.) >>> > >>> > I would still be a bit curious how secret the secret sauce of >>> > StataCorp is for this, as this "collapsing" is pretty commonplace for >>> > many descriptives (also bar graphs, line graphs etc), and while they >>> > are rightly proud if they could tweak -tabulate- to run this fast, >>> > they perhaps could let us (and themselves) working towards other >>> > similar code also running faster. Though, of course, there must be a >>> > reason (general purpose etc.) while this is harder elsewhere. >>> > >>> > Thanks again, >>> > >>> > Laszlo >>> > >>> > tempvar wsum tag >>> > >>> > if ("`y2_var'"!="") local y2 y2 >>> > else local y2 "" >>> > >>> > sort `x_q' `touse' >>> > by `x_q' `touse': g byte `tag' = _n == _N >>> > if ("`weight1'"!="") by `x_q' `touse': g `wsum' = sum(`weight1') >>> > else by `x_q' `touse': g `wsum' = _N >>> > >>> > foreach v in x y `y2' { >>> > if ("`weight1'"!="") by `x_q' `touse': g ``v'_mean' = sum(``v'_r'*`weight1') >>> > else by `x_q' `touse': g ``v'_mean' = sum(``v'_r') >>> > >>> > quietly replace ``v'_mean' = cond(`tag' & `touse',``v'_mean'/`wsum',.) >>> > } >>> > >>> > On Mon, Apr 2, 2012 at 6:11 PM, Nick Cox <njcoxstata@gmail.com> wrote: >>> >> >>> >> I will look at it tomorrow. >>> >> >>> >> 2012/4/2 László Sándor <sandorl@gmail.com>: >>> >> > Nick, >>> >> > >>> >> > thanks, I did follow up with your post. Sadly, I could not easily get >>> >> > -by- working, or to be precise, to use the variables that it >>> >> > generated. Below I have an attempt, if I can take liberty with your >>> >> > time and expect you to parse it, I am grateful for comments to get it >>> >> > working -- the indexing must be off. It tries to average two (x_r and >>> >> > y_r) or three (y2_r extra) variables. It generates too large values >>> >> > for some bins (i.e. from U[0,1] variables some averages become larger >>> >> > than 20.) >>> >> > >>> >> > I am happy if someone from StataCorp follows up too! :) >>> >> > >>> >> > Thanks, >>> >> > >>> >> > László >>> >> > >>> >> > tempvar wsum tag ones >>> >> > g byte `ones' = 1 >>> >> > >>> >> > >>> >> > if ("`y2_var'"!="") local y2 y2 >>> >> > else local y2 "" >>> >> > >>> >> > >>> >> > if ("`weight1'"!="") g `wsum' = sum(`weight1') if `touse' >>> >> > else g `wsum' = sum(`ones') if `touse' >>> >> > >>> >> > >>> >> > sort `x_q' >>> >> > by `x_q': g byte `tag' = _N if `touse' >>> >> > >>> >> > foreach v in x y `y2' { >>> >> > if "`weight1'"!=""{ >>> >> > by `x_q': g ``v'_mean' = sum(``v'_r'*`weight1') if `touse' >>> >> > by `x_q': replace ``v'_mean' = ``v'_mean'/`wsum' if `tag' & `touse' >>> >> > } >>> >> > >>> >> > else { >>> >> > by `x_q': g ``v'_mean' = sum(``v'_r') if `touse' >>> >> > by `x_q': replace ``v'_mean' = ``v'_mean'/`wsum' if `tag' & `touse' >>> >> > } >>> >> > } >>> >> > >>> >> > >>> >> > On Mon, Apr 2, 2012 at 3:36 PM, Nick Cox <njcoxstata@gmail.com> wrote: >>> >> >> >>> >> >> We are back to the questions you asked a week ago. Mostly this is for >>> >> >> StataCorp. Otherwise please see again my answers at >>> >> >> >>> >> >> http://www.stata.com/statalist/archive/2012-03/msg01144.html >>> >> >> >>> >> >> I've had dramatic speed-ups with Mata -- my record is reducing >>> >> >> execution time from 5 days to 2 minutes, but that was partly because >>> >> >> my original code was so dumb -- but I've not tried anything like the >>> >> >> stuff you were using. >>> >> >> >>> >> >> -tabulate, summarize- is compiled C code. I think the nearest you can >>> >> >> get is by using -by:- as explained in the post just quoted. >>> >> >> >>> >> >> Nick >>> >> >> >>> >> >> 2012/4/2 László Sándor <sandorl@gmail.com>: >>> >> >> > Hi all, >>> >> >> > >>> >> >> > I had several questions recently on this list about compiling Mata >>> >> >> > code. I still could not deal with generating the compile time locals >>> >> >> > with loops, but I typed them out and compiled. Now I had my test runs >>> >> >> > but they are surprising. Let me ask you why: >>> >> >> > >>> >> >> > My basic problem was to do a fast "collapse" to make binned scatter >>> >> >> > plots. Collapse was unacceptably slow, probably because of the >>> >> >> > necessary preserve-restore cycles, or inefficient coding of collapse >>> >> >> > (for its general purpose). >>> >> >> > >>> >> >> > I already had a version that parsed a log of -tabulate, summarize-. >>> >> >> > Yes, it is as much of a hack as it sounds like. I was not expecting >>> >> >> > this to be fast, at least because of the file I/O and the parsing. >>> >> >> > >>> >> >> > Now I built a Mata function that "collapses" into new variables with >>> >> >> > leaving the data intact otherwise. For this I used Ben Jann's >>> >> >> > -mf_mm_collapse-, and compiled all the necessary functions myself in >>> >> >> > the ado file. >>> >> >> > >>> >> >> > And the test run with 100 million observations told me it was slower >>> >> >> > than the hack. Before I give up and claim the hack unbeatable, I have >>> >> >> > one suspicion. I had the test run on Stata 12 MP on a cluster, with >>> >> >> > 12 >>> >> >> > cores. Perhaps -tabulate- used all of them, and my code did not. >>> >> >> > >>> >> >> > Are there guidelines how to speed up Mata in this situation (if it is >>> >> >> > not MP-aware to begin with?). >>> >> >> > >>> >> >> > Or, tentatively, can I ask for some guidance about the magic of >>> >> >> > -tabulate, summarize-? Is that magic accessible/reproducible without >>> >> >> > just logging its output? >>> >> >> > >>> * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/