Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | László Sándor <sandorl@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: where is StataCorp C code located? all in a single executable as compiled binary? |
Date | Mon, 19 Aug 2013 11:21:03 -0400 |
But of course, the fastest example above is cheating a bit, as it know the values of v1 and v2. A simple -bysort- to circumvent that would immediately punish us heavily with sorting dozens of gigabytes. But-but-but, my main use case uses the discrete values of a variable. Is -levelsof- faster than -bys- (then why isn't it used more often?). Or as in most cases the discrete values come from a previous xtiling, I know the value of this variable, or might even keep track of the quantiles in a local somewhere. Thanks for any thoughts on speeding up binned averaging. On Mon, Aug 19, 2013 at 10:50 AM, László Sándor <sandorl@gmail.com> wrote: > Credit where credit is due: I meant Eric and Phil's tests, of course, > I apologize, with Roger's thoughts also much appreciated. > > I am still surprised that loops of interpreted code beats the built-in > C. So maybe -tabulate- was not heavily optimized in the end. > > Thanks for everything! > > Laszlo > > On Mon, Aug 19, 2013 at 10:45 AM, László Sándor <sandorl@gmail.com> wrote: >> Thanks, all. >> >> I am still confused how I could combine the speed of sum of the >> methods like -collapse- without losing my data, which usually takes >> dozens of GBs. >> >> Otherwise I think we are only talking about -tabulate- versus -table- >> but both need log-parsing, or some -bys: summarize- and collecting >> locals, which I did not attempt. >> >> FWIW, I also ran Roger's tests. Actually, I am surprised by the speed >> of the many lines of -summarize, meanonly-, esp. as it runs over the >> dataset many times just ifs in different observations. >> >> On an 8-core StataMP 13 for Linux, >> full -tabulate, sum- itself took ~140s >> -tab, matcell- took <5s, but indeed generates frequencies only. >> a second -tabulate, sum-, even with nof and nost, took the same >> also with the caplog wrapper >> a -collapse, fast- took 36s, but of course this loses the data >> the -summarize- took 92s without the postfiles, 34.53s with — but I >> still cannot scatteri the results in the same Stata instance… >> >> On a 64-core StataMP 13 (in a cluster, with nodes of 8 cores plus MPIing): >> full -tabulate, sum- itself took ~195s >> -tab, matcell-: 8s >> again same speed without frequencies and standard deviations, or with >> the wrapper, for -tab, sum- >> -collapse- took 60s >> the loops of -summarize- took 160s now without the postfiles, 47s with. >> >> Thanks! >> >> Laszlo >> >> On Mon, Aug 19, 2013 at 8:59 AM, Phil Clayton >> <philclayton@internode.on.net> wrote: >>> There's no need to speculate - Eric and I provided example code, it's easy to test it and see for yourself. On my system (Stata/IC 13 for Mac) -tab, sum()- is definitely not the fastest method. >>> >>> Stata can only handle one dataset in memory, but it can store plenty of scalars, macros and matrices. Since all you want to do is plot the results using -scatteri- there is no need to have the results in a dataset anyway... (although for ease of programming a single -preserve- to access the results is often not too big a hit) >>> >>> Phil >>> >>> On 19/08/2013, at 10:16 PM, László Sándor <sandorl@gmail.com> wrote: >>> >>>> Thanks for all this. >>>> >>>> Maybe I got Phil wrong, but I'd be surprised if -tab, sum()- is not >>>> the fastest method by far. >>>> >>>> But indeed, having multiple datasets in memory is the bottleneck, so I >>>> am not sure whether postfile or logout would solve much of the problem >>>> — as for the results from the new files, I'd need to lose the current >>>> data (or preserve and restore it). >>>> >>>> Currently, I am working on reading in the tabulated values into macros >>>> to plug them into -scatteri-, but it is a hack. >>>> >>>> Thanks again, >>>> >>>> Laszlo >>>> >>>> On Mon, Aug 19, 2013 at 7:26 AM, Roger B. Newson >>>> <r.newson@imperial.ac.uk> wrote: >>>>> The main problem with this solution is that you have to put in a lot more >>>>> programming time, especially if you want to conserve the variable labels, >>>>> value labels etc. of the by-variables. (That at least is my excuse for the >>>>> CPU-intensive, near-SAS-like and 20th-century-looking method that I still >>>>> tend to use.) >>>>> >>>>> IMHO it is a major limitation of Stata that it cannot store any number of >>>>> datasets (or dataframes) in the memory at a time. If it could, then we would >>>>> not be forced to use -preserve- and -restore- so often and burn computer >>>>> time in file I/O, just to conserve person-days. >>>>> >>>>> On the other hand, R (the main serious non-legacy competitor to Stata >>>>> nowadays) has the even greater limitation that it doesn't have anything >>>>> quite like Mata. Plus only a few of my colleagues seem to be confident using >>>>> R!!! >>>>> >>>>> >>>>> Best wishes >>>>> >>>>> Roger >>>>> >>>>> Roger B Newson BSc MSc DPhil >>>>> Lecturer in Medical Statistics >>>>> Respiratory Epidemiology and Public Health Group >>>>> National Heart and Lung Institute >>>>> Imperial College London >>>>> Royal Brompton Campus >>>>> Room 33, Emmanuel Kaye Building >>>>> 1B Manresa Road >>>>> London SW3 6LR >>>>> UNITED KINGDOM >>>>> Tel: +44 (0)20 7352 8121 ext 3381 >>>>> Fax: +44 (0)20 7351 8322 >>>>> Email: r.newson@imperial.ac.uk >>>>> Web page: http://www.imperial.ac.uk/nhli/r.newson/ >>>>> Departmental Web page: >>>>> http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/ >>>>> >>>>> Opinions expressed are those of the author, not of the institution. >>>>> >>>>> On 19/08/2013 01:06, Phil Clayton wrote: >>>>>> >>>>>> If you can avoid the -preserve- and -restore- you save loads of time (at >>>>>> least on my modest system...) >>>>>> >>>>>> *--ex5. using summarize and postfile** >>>>>> tempname post >>>>>> tempfile postfile >>>>>> postfile `post' v1 v2 mean sd n using "`postfile'" >>>>>> forval x = 4(-1)1 { >>>>>> forval y = 3(-1)1 { >>>>>> display "v1=`x', v2=`y'" >>>>>> qui sum v3 if v1==`x' & v2 == `y' >>>>>> post `post' (`x') (`y') (`r(mean)') (`r(sd)') (`r(N)') >>>>>> } //end of y loop >>>>>> } //end of x loop >>>>>> postclose `post' >>>>>> use "`postfile'", clear >>>>>> >>>>>> On 19/08/2013, at 8:31 AM, Eric A. Booth <eric.a.booth@gmail.com> wrote: >>>>>> >>>>>>> <> >>>>>>> Hi Laszlo: I agree that it would be nice if -tabulate,summarize()- >>>>>>> stored values but it doesnt. There are several options available to >>>>>>> store those values and then use them elsewhere. The issues seem to be >>>>>>> (1) ease of parsing the values into a format that you can use for >>>>>>> other analyses and (2) (and more important for you) the speed with >>>>>>> which you can calculate, store, parse, and then use those values. >>>>>>> >>>>>>> Some alternatives to collapse include logging the -tabulate, >>>>>>> summarize()- output and then parsing it, using -collapse- to get your >>>>>>> values, or using the compiled -summarize- command to obtain the >>>>>>> values of interest and store them for use elsewhere. I'm sure there >>>>>>> are other options, but below is a comparison of these methods against >>>>>>> the speed of the desired -tabulate, summarize()- solution on a >>>>>>> large-ish fake dataset. >>>>>>> >>>>>>> This is not a clean comparison and the values I store for later use >>>>>>> are not exactly the same in every example, but it gives you an idea of >>>>>>> the speed differences of the steps that might be involved for each >>>>>>> approach (that is, preserving the data, summarizing or collapsing or >>>>>>> XX, storing and parsing the output, and restoring the data). The >>>>>>> upshot is that, for this example on my computer, it seems that running >>>>>>> -summarize- in a loop to grab the values you want and store them in a >>>>>>> dataset was the quickest non-tab, summarize()- option I tried (example >>>>>>> 4 below), but this would be slower on a lot of data points. Plus, >>>>>>> both Examples 3 & 4 below are both faster than running -tabulate, >>>>>>> summarize()-. >>>>>>> >>>>>>> Using -tabulate, summarize()- to get values takes about 101 seconds >>>>>>> to run in my example. >>>>>>> Example 1 is regular tabulate example with cells stored in a matrix -- >>>>>>> this took about 9 seconds, but doesnt require any calculation of means >>>>>>> or what not. Ex 2 is using -logout- to parse the syntax (you could do >>>>>>> this manually too) and took the longest at about 109 seconds. Ex 3 >>>>>>> uses -collapse- with preserve/restore and takes about 36 seconds. Ex >>>>>>> 4 uses a loop to grab means from summarize for certain values and >>>>>>> takes about 27 seconds. >>>>>>> >>>>>>> *********************! Begin Example >>>>>>> //intro stuff// >>>>>>> clear all >>>>>>> timer clear >>>>>>> set rmsg on >>>>>>> *--install packages for the example >>>>>>> cap which logout >>>>>>> if _rc ssc install logout , replace >>>>>>> *--make fake data >>>>>>> sa master.dta, replace emptyok //for later >>>>>>> set obs `=2^25' //run on a big dataset >>>>>>> forval x = 1/10 { >>>>>>> g v`x' = round(runiform()*5) >>>>>>> } >>>>>>> >>>>>>> >>>>>>> //examples// >>>>>>> ** >>>>>>> tabulate v1 v2, summarize(v3) //for ref. takes c.108 Seconds >>>>>>> ** >>>>>>> >>>>>>> *--ex1. time working with -tab- stored values** >>>>>>> **this doesnt get the values you need.. >>>>>>> **but allows us to compare speed of these approaches somewhat >>>>>>> tab v1 v2, matcell(A) >>>>>>> mat list A >>>>>>> preserve >>>>>>> clear >>>>>>> svmat A, names(A) >>>>>>> keep A1 >>>>>>> keep in 1/3 //parse >>>>>>> l >>>>>>> restore >>>>>>> >>>>>>> >>>>>>> *--ex2. parsing the tab, summarize() output** >>>>>>> *logout* >>>>>>> preserve >>>>>>> caplog using mystuff.txt, replace: tabulate v1 v2, summarize(v3) nof >>>>>>> nost >>>>>>> logout, use(mystuff.txt) save(mytable) clear dta replace >>>>>>> u mytable.dta, clear >>>>>>> keep v1 v2 >>>>>>> keep in 4/6 //parse as needed >>>>>>> restore >>>>>>> *! or just log this and parse it yourself, probably faster to do so >>>>>>> >>>>>>> >>>>>>> >>>>>>> *--ex3. using collapse** >>>>>>> *this might be your best option if you have a lot of datapoints to >>>>>>> calculate/store*! >>>>>>> preserve >>>>>>> collapse (mean) v3 , by(v1 v2) >>>>>>> keep v2 v3 >>>>>>> keep in 2/5 //parse >>>>>>> l >>>>>>> restore >>>>>>> >>>>>>> >>>>>>> *--ex4. using summarize** >>>>>>> forval x = 4(-1)1 { >>>>>>> forval y = 3(-1)1 { >>>>>>> qui sum v3 if v1==`x' & v2 == `y', meanonly >>>>>>> loc val`x' `r(mean)' >>>>>>> preserve >>>>>>> clear >>>>>>> set obs 1 >>>>>>> g name = "`x' and `y'" >>>>>>> g v1 = `val`x'' in 1 >>>>>>> append using master.dta >>>>>>> sa master.dta, replace //values you need are in this dta file >>>>>>> restore >>>>>>> } //end of y loop >>>>>>> } //end of x loop >>>>>>> *********************! End Example >>>>>>> note: -timer- was reseting after the internal programming of -logout- >>>>>>> was clearing the timer each time, so I just added up across the -rmsg- >>>>>>> timings. >>>>>>> >>>>>>> >>>>>>> >>>>>>> HTH, >>>>>>> >>>>>>> Eric >>>>>>> ___ >>>>>>> Eric A. Booth >>>>>>> Research Scientist >>>>>>> Gibson Consulting Group >>>>>>> ebooth@gibsonconsult.com >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sun, Aug 18, 2013 at 4:26 PM, László Sándor <sandorl@gmail.com> wrote: >>>>>>>> >>>>>>>> >>>>>>>> Thanks again! >>>>>>>> >>>>>>>> I am not sure if those preserve-and-restore the data, but I should >>>>>>>> check. >>>>>>>> >>>>>>>> I think what will happen is that I log the -tab, sum()-, and somehow >>>>>>>> read in numbers from the log file without opening a new dataset, and >>>>>>>> plot "immediately" with -scatteri-. >>>>>>>> >>>>>>>> Laszlo >>>>>>>> >>>>>>>> On Sun, Aug 18, 2013 at 5:04 PM, Roger B. Newson >>>>>>>> <r.newson@imperial.ac.uk> wrote: >>>>>>>>> >>>>>>>>> One way of doing what you want is probably to use the -xcontract- and >>>>>>>>> -xcollapse- packages, which you can download from SSC. These are >>>>>>>>> extended >>>>>>>>> versions of -collapse- and -contract-, which can save the output >>>>>>>>> datasets >>>>>>>>> (or resultssets) to Stata .dta files on disk, with which the user can >>>>>>>>> do all >>>>>>>>> kinds of plotting and tabulating. >>>>>>>>> >>>>>>>>> >>>>>>>>> Best wishes >>>>>>>>> >>>>>>>>> Roger >>>>>>>>> >>>>>>>>> Roger B Newson BSc MSc DPhil >>>>>>>>> Lecturer in Medical Statistics >>>>>>>>> Respiratory Epidemiology and Public Health Group >>>>>>>>> National Heart and Lung Institute >>>>>>>>> Imperial College London >>>>>>>>> Royal Brompton Campus >>>>>>>>> Room 33, Emmanuel Kaye Building >>>>>>>>> 1B Manresa Road >>>>>>>>> London SW3 6LR >>>>>>>>> UNITED KINGDOM >>>>>>>>> Tel: +44 (0)20 7352 8121 ext 3381 >>>>>>>>> Fax: +44 (0)20 7351 8322 >>>>>>>>> Email: r.newson@imperial.ac.uk >>>>>>>>> Web page: http://www.imperial.ac.uk/nhli/r.newson/ >>>>>>>>> Departmental Web page: >>>>>>>>> >>>>>>>>> http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/ >>>>>>>>> >>>>>>>>> Opinions expressed are those of the author, not of the institution. >>>>>>>>> >>>>>>>>> On 18/08/2013 21:49, László Sándor wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks, Roger. >>>>>>>>>> >>>>>>>>>> I never meant that StataCorp should give away their source. I was only >>>>>>>>>> hoping to squeeze out some more interoperability. And so much of the >>>>>>>>>> rest of the code is in smaller chunks. Not -tabulate-, I see. >>>>>>>>>> >>>>>>>>>> I should have thought of -which-. >>>>>>>>>> >>>>>>>>>> I only wanted to capture some of the results/output without logging >>>>>>>>>> and parsing the log. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Laszlo >>>>>>>>>> >>>>>>>>>> On Sun, Aug 18, 2013 at 4:31 PM, Roger B. Newson >>>>>>>>>> <r.newson@imperial.ac.uk> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I think you'll find that everything really is in the executable >>>>>>>>>>> "/Applications/Stata/StataMP.app/Contents/MacOS/StataMP". This is >>>>>>>>>>> because >>>>>>>>>>> Stata is not open-source, and was never supposed to be. StataCorp >>>>>>>>>>> have to >>>>>>>>>>> make a living, and would probably not be able to do so if it was >>>>>>>>>>> open-source >>>>>>>>>>> and users could make generic copies. >>>>>>>>>>> >>>>>>>>>>> A lot of the code for a lot of official Stata is open-source (ie in >>>>>>>>>>> ado-files), but -tabulate- isn't. If you type, in Stata, >>>>>>>>>>> >>>>>>>>>>> which tabulate >>>>>>>>>>> >>>>>>>>>>> then Stata will answer >>>>>>>>>>> >>>>>>>>>>> built-in command: tabulate >>>>>>>>>>> >>>>>>>>>>> meaning that there is no file -tabulate.ado-. >>>>>>>>>>> >>>>>>>>>>> I hope this helps. >>>>>>>>>>> >>>>>>>>>>> Best wishes >>>>>>>>>>> >>>>>>>>>>> Roger >>>>>>>>>>> >>>>>>>>>>> Roger B Newson BSc MSc DPhil >>>>>>>>>>> Lecturer in Medical Statistics >>>>>>>>>>> Respiratory Epidemiology and Public Health Group >>>>>>>>>>> National Heart and Lung Institute >>>>>>>>>>> Imperial College London >>>>>>>>>>> Royal Brompton Campus >>>>>>>>>>> Room 33, Emmanuel Kaye Building >>>>>>>>>>> 1B Manresa Road >>>>>>>>>>> London SW3 6LR >>>>>>>>>>> UNITED KINGDOM >>>>>>>>>>> Tel: +44 (0)20 7352 8121 ext 3381 >>>>>>>>>>> Fax: +44 (0)20 7351 8322 >>>>>>>>>>> Email: r.newson@imperial.ac.uk >>>>>>>>>>> Web page: http://www.imperial.ac.uk/nhli/r.newson/ >>>>>>>>>>> Departmental Web page: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/ >>>>>>>>>>> >>>>>>>>>>> Opinions expressed are those of the author, not of the institution. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 18/08/2013 21:21, László Sándor wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Hi all, >>>>>>>>>>>> >>>>>>>>>>>> I am trying to understand how -tabulate, summarize- works. I >>>>>>>>>>>> understand that much of it is written in C code, but I would still >>>>>>>>>>>> expect to find some black boxes of files that do the magic. I think >>>>>>>>>>>> I >>>>>>>>>>>> checked all folders, incl. hidden folders within /Applications/Stata >>>>>>>>>>>> on my mac, and even checked the package contents of >>>>>>>>>>>> /Applications/Stata/StataMP. I found no trace of -tabulate-, or any >>>>>>>>>>>> other plugin/DLL whatsoever. Is everything wrapped into the Unix >>>>>>>>>>>> executable "/Applications/Stata/StataMP.app/Contents/MacOS/StataMP"? >>>>>>>>>>>> Really? >>>>>>>>>>>> >>>>>>>>>>>> As I only need the results of -tab, sum()-, I hope to see some code >>>>>>>>>>>> calling -_tab.ado- or some other code to display the results. Is >>>>>>>>>>>> everything in the compiled binary instead? >>>>>>>>>>>> >>>>>>>>>>>> Well, something must add up those 33.9 MBs… >>>>>>>>>>>> >>>>>>>>>>>> Thanks for any thoughts, >>>>>>>>>>>> >>>>>>>>>>>> Laszlo >>>>>>>>>>>> >>>>>>>>>>>> * >>>>>>>>>>>> * For searches and help try: >>>>>>>>>>>> * http://www.stata.com/help.cgi?search >>>>>>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>>>>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>>>>>>>> >>>>>>>>>>> * >>>>>>>>>>> * For searches and help try: >>>>>>>>>>> * http://www.stata.com/help.cgi?search >>>>>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>>>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> * >>>>>>>>>> * For searches and help try: >>>>>>>>>> * http://www.stata.com/help.cgi?search >>>>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>>>>>> >>>>>>>>> * >>>>>>>>> * For searches and help try: >>>>>>>>> * http://www.stata.com/help.cgi?search >>>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>>>> >>>>>>>> >>>>>>>> * >>>>>>>> * For searches and help try: >>>>>>>> * http://www.stata.com/help.cgi?search >>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>>> >>>>>>> >>>>>>> * >>>>>>> * For searches and help try: >>>>>>> * http://www.stata.com/help.cgi?search >>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>> >>>>>> >>>>>> >>>>>> * >>>>>> * For searches and help try: >>>>>> * http://www.stata.com/help.cgi?search >>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>> >>>>> * >>>>> * For searches and help try: >>>>> * http://www.stata.com/help.cgi?search >>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>>> * http://www.ats.ucla.edu/stat/stata/ >>>> >>>> * >>>> * For searches and help try: >>>> * http://www.stata.com/help.cgi?search >>>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>>> * http://www.ats.ucla.edu/stat/stata/ >>> >>> >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/faqs/resources/statalist-faq/ >>> * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/