Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <njcoxstata@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: how to parallelize Mata (or steal the performance of built-in -tab, summarize-) |
Date | Tue, 3 Apr 2012 10:01:48 +0100 |
Overnight I remembered -binsm- SJ-6-1 gr26_1 . . . . . . . . . . . . . . . . . . Software update for binsm (help binsm if installed) . . . . . . . . . . . . . . . . . N. J. Cox Q1/06 SJ 6(1):151 rewritten to support modern Stata graphics STB-37 gr26 . . . . . . . . . . . Bin smoothing and summary on scatter plots (help binsm if installed) . . . . . . . . . . . . . . . . . N. J. Cox 5/97 pp.9--12; STB Reprints Vol 7, pp.59--63 alternative to graph, twoway bands(); produces a scatterplot of yvar against xvar with one or more summaries of yvar for bins of xvar and -twoway__histogram_gen- SJ-5-2 gr0014 . . . . . . . Stata tip 20: Generating histogram bin variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . D. A. Harrison Q2/05 SJ 5(2):280--281 (no commands) tip illustrating the use of twoway__histogram_gen for creation of complex histograms and other graphs or tables My strategic advice is this. You want a reduced dataset for graphing, so -drop- aggressively. Once you have identified observations "to use", go keep if `touse' drop `touse' Once the mean is in the last observation of every block of observations, -drop- all the others. 2012/4/3 László Sándor <sandorl@gmail.com>: > Thanks for this, Nick. > > I found my (plenty and embarrassing) mistakes in my code, below is a > neater version that also actually does what it should, or so it seems. > > That said, it is still rarely faster than logging -tab, sum()- though > with many millions of observations, running on many (>4) cores, it at > least has a little advantage. (But both beat my bare bones Mata > attempts.) > > I would still be a bit curious how secret the secret sauce of > StataCorp is for this, as this "collapsing" is pretty commonplace for > many descriptives (also bar graphs, line graphs etc), and while they > are rightly proud if they could tweak -tabulate- to run this fast, > they perhaps could let us (and themselves) working towards other > similar code also running faster. Though, of course, there must be a > reason (general purpose etc.) while this is harder elsewhere. > > Thanks again, > > Laszlo > > tempvar wsum tag > > if ("`y2_var'"!="") local y2 y2 > else local y2 "" > > sort `x_q' `touse' > by `x_q' `touse': g byte `tag' = _n == _N > if ("`weight1'"!="") by `x_q' `touse': g `wsum' = sum(`weight1') > else by `x_q' `touse': g `wsum' = _N > > foreach v in x y `y2' { > if ("`weight1'"!="") by `x_q' `touse': g ``v'_mean' = sum(``v'_r'*`weight1') > else by `x_q' `touse': g ``v'_mean' = sum(``v'_r') > > quietly replace ``v'_mean' = cond(`tag' & `touse',``v'_mean'/`wsum',.) > } > > On Mon, Apr 2, 2012 at 6:11 PM, Nick Cox <njcoxstata@gmail.com> wrote: >> >> I will look at it tomorrow. >> >> 2012/4/2 László Sándor <sandorl@gmail.com>: >> > Nick, >> > >> > thanks, I did follow up with your post. Sadly, I could not easily get >> > -by- working, or to be precise, to use the variables that it >> > generated. Below I have an attempt, if I can take liberty with your >> > time and expect you to parse it, I am grateful for comments to get it >> > working -- the indexing must be off. It tries to average two (x_r and >> > y_r) or three (y2_r extra) variables. It generates too large values >> > for some bins (i.e. from U[0,1] variables some averages become larger >> > than 20.) >> > >> > I am happy if someone from StataCorp follows up too! :) >> > >> > Thanks, >> > >> > László >> > >> > tempvar wsum tag ones >> > g byte `ones' = 1 >> > >> > >> > if ("`y2_var'"!="") local y2 y2 >> > else local y2 "" >> > >> > >> > if ("`weight1'"!="") g `wsum' = sum(`weight1') if `touse' >> > else g `wsum' = sum(`ones') if `touse' >> > >> > >> > sort `x_q' >> > by `x_q': g byte `tag' = _N if `touse' >> > >> > foreach v in x y `y2' { >> > if "`weight1'"!=""{ >> > by `x_q': g ``v'_mean' = sum(``v'_r'*`weight1') if `touse' >> > by `x_q': replace ``v'_mean' = ``v'_mean'/`wsum' if `tag' & `touse' >> > } >> > >> > else { >> > by `x_q': g ``v'_mean' = sum(``v'_r') if `touse' >> > by `x_q': replace ``v'_mean' = ``v'_mean'/`wsum' if `tag' & `touse' >> > } >> > } >> > >> > >> > On Mon, Apr 2, 2012 at 3:36 PM, Nick Cox <njcoxstata@gmail.com> wrote: >> >> >> >> We are back to the questions you asked a week ago. Mostly this is for >> >> StataCorp. Otherwise please see again my answers at >> >> >> >> http://www.stata.com/statalist/archive/2012-03/msg01144.html >> >> >> >> I've had dramatic speed-ups with Mata -- my record is reducing >> >> execution time from 5 days to 2 minutes, but that was partly because >> >> my original code was so dumb -- but I've not tried anything like the >> >> stuff you were using. >> >> >> >> -tabulate, summarize- is compiled C code. I think the nearest you can >> >> get is by using -by:- as explained in the post just quoted. >> >> >> >> Nick >> >> >> >> 2012/4/2 László Sándor <sandorl@gmail.com>: >> >> > Hi all, >> >> > >> >> > I had several questions recently on this list about compiling Mata >> >> > code. I still could not deal with generating the compile time locals >> >> > with loops, but I typed them out and compiled. Now I had my test runs >> >> > but they are surprising. Let me ask you why: >> >> > >> >> > My basic problem was to do a fast "collapse" to make binned scatter >> >> > plots. Collapse was unacceptably slow, probably because of the >> >> > necessary preserve-restore cycles, or inefficient coding of collapse >> >> > (for its general purpose). >> >> > >> >> > I already had a version that parsed a log of -tabulate, summarize-. >> >> > Yes, it is as much of a hack as it sounds like. I was not expecting >> >> > this to be fast, at least because of the file I/O and the parsing. >> >> > >> >> > Now I built a Mata function that "collapses" into new variables with >> >> > leaving the data intact otherwise. For this I used Ben Jann's >> >> > -mf_mm_collapse-, and compiled all the necessary functions myself in >> >> > the ado file. >> >> > >> >> > And the test run with 100 million observations told me it was slower >> >> > than the hack. Before I give up and claim the hack unbeatable, I have >> >> > one suspicion. I had the test run on Stata 12 MP on a cluster, with >> >> > 12 >> >> > cores. Perhaps -tabulate- used all of them, and my code did not. >> >> > >> >> > Are there guidelines how to speed up Mata in this situation (if it is >> >> > not MP-aware to begin with?). >> >> > >> >> > Or, tentatively, can I ask for some guidance about the magic of >> >> > -tabulate, summarize-? Is that magic accessible/reproducible without >> >> > just logging its output? >> >> > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/