Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
From
László Sándor <[email protected]>
To
[email protected]
Subject
Re: st: where is StataCorp C code located? all in a single executable as compiled binary?
Date
Mon, 19 Aug 2013 08:16:33 -0400
Thanks for all this.
Maybe I got Phil wrong, but I'd be surprised if -tab, sum()- is not
the fastest method by far.
But indeed, having multiple datasets in memory is the bottleneck, so I
am not sure whether postfile or logout would solve much of the problem
— as for the results from the new files, I'd need to lose the current
data (or preserve and restore it).
Currently, I am working on reading in the tabulated values into macros
to plug them into -scatteri-, but it is a hack.
Thanks again,
Laszlo
On Mon, Aug 19, 2013 at 7:26 AM, Roger B. Newson
<[email protected]> wrote:
> The main problem with this solution is that you have to put in a lot more
> programming time, especially if you want to conserve the variable labels,
> value labels etc. of the by-variables. (That at least is my excuse for the
> CPU-intensive, near-SAS-like and 20th-century-looking method that I still
> tend to use.)
>
> IMHO it is a major limitation of Stata that it cannot store any number of
> datasets (or dataframes) in the memory at a time. If it could, then we would
> not be forced to use -preserve- and -restore- so often and burn computer
> time in file I/O, just to conserve person-days.
>
> On the other hand, R (the main serious non-legacy competitor to Stata
> nowadays) has the even greater limitation that it doesn't have anything
> quite like Mata. Plus only a few of my colleagues seem to be confident using
> R!!!
>
>
> Best wishes
>
> Roger
>
> Roger B Newson BSc MSc DPhil
> Lecturer in Medical Statistics
> Respiratory Epidemiology and Public Health Group
> National Heart and Lung Institute
> Imperial College London
> Royal Brompton Campus
> Room 33, Emmanuel Kaye Building
> 1B Manresa Road
> London SW3 6LR
> UNITED KINGDOM
> Tel: +44 (0)20 7352 8121 ext 3381
> Fax: +44 (0)20 7351 8322
> Email: [email protected]
> Web page: http://www.imperial.ac.uk/nhli/r.newson/
> Departmental Web page:
> http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/
>
> Opinions expressed are those of the author, not of the institution.
>
> On 19/08/2013 01:06, Phil Clayton wrote:
>>
>> If you can avoid the -preserve- and -restore- you save loads of time (at
>> least on my modest system...)
>>
>> *--ex5. using summarize and postfile**
>> tempname post
>> tempfile postfile
>> postfile `post' v1 v2 mean sd n using "`postfile'"
>> forval x = 4(-1)1 {
>> forval y = 3(-1)1 {
>> display "v1=`x', v2=`y'"
>> qui sum v3 if v1==`x' & v2 == `y'
>> post `post' (`x') (`y') (`r(mean)') (`r(sd)') (`r(N)')
>> } //end of y loop
>> } //end of x loop
>> postclose `post'
>> use "`postfile'", clear
>>
>> On 19/08/2013, at 8:31 AM, Eric A. Booth <[email protected]> wrote:
>>
>>> <>
>>> Hi Laszlo: I agree that it would be nice if -tabulate,summarize()-
>>> stored values but it doesnt. There are several options available to
>>> store those values and then use them elsewhere. The issues seem to be
>>> (1) ease of parsing the values into a format that you can use for
>>> other analyses and (2) (and more important for you) the speed with
>>> which you can calculate, store, parse, and then use those values.
>>>
>>> Some alternatives to collapse include logging the -tabulate,
>>> summarize()- output and then parsing it, using -collapse- to get your
>>> values, or using the compiled -summarize- command to obtain the
>>> values of interest and store them for use elsewhere. I'm sure there
>>> are other options, but below is a comparison of these methods against
>>> the speed of the desired -tabulate, summarize()- solution on a
>>> large-ish fake dataset.
>>>
>>> This is not a clean comparison and the values I store for later use
>>> are not exactly the same in every example, but it gives you an idea of
>>> the speed differences of the steps that might be involved for each
>>> approach (that is, preserving the data, summarizing or collapsing or
>>> XX, storing and parsing the output, and restoring the data). The
>>> upshot is that, for this example on my computer, it seems that running
>>> -summarize- in a loop to grab the values you want and store them in a
>>> dataset was the quickest non-tab, summarize()- option I tried (example
>>> 4 below), but this would be slower on a lot of data points. Plus,
>>> both Examples 3 & 4 below are both faster than running -tabulate,
>>> summarize()-.
>>>
>>> Using -tabulate, summarize()- to get values takes about 101 seconds
>>> to run in my example.
>>> Example 1 is regular tabulate example with cells stored in a matrix --
>>> this took about 9 seconds, but doesnt require any calculation of means
>>> or what not. Ex 2 is using -logout- to parse the syntax (you could do
>>> this manually too) and took the longest at about 109 seconds. Ex 3
>>> uses -collapse- with preserve/restore and takes about 36 seconds. Ex
>>> 4 uses a loop to grab means from summarize for certain values and
>>> takes about 27 seconds.
>>>
>>> *********************! Begin Example
>>> //intro stuff//
>>> clear all
>>> timer clear
>>> set rmsg on
>>> *--install packages for the example
>>> cap which logout
>>> if _rc ssc install logout , replace
>>> *--make fake data
>>> sa master.dta, replace emptyok //for later
>>> set obs `=2^25' //run on a big dataset
>>> forval x = 1/10 {
>>> g v`x' = round(runiform()*5)
>>> }
>>>
>>>
>>> //examples//
>>> **
>>> tabulate v1 v2, summarize(v3) //for ref. takes c.108 Seconds
>>> **
>>>
>>> *--ex1. time working with -tab- stored values**
>>> **this doesnt get the values you need..
>>> **but allows us to compare speed of these approaches somewhat
>>> tab v1 v2, matcell(A)
>>> mat list A
>>> preserve
>>> clear
>>> svmat A, names(A)
>>> keep A1
>>> keep in 1/3 //parse
>>> l
>>> restore
>>>
>>>
>>> *--ex2. parsing the tab, summarize() output**
>>> *logout*
>>> preserve
>>> caplog using mystuff.txt, replace: tabulate v1 v2, summarize(v3) nof
>>> nost
>>> logout, use(mystuff.txt) save(mytable) clear dta replace
>>> u mytable.dta, clear
>>> keep v1 v2
>>> keep in 4/6 //parse as needed
>>> restore
>>> *! or just log this and parse it yourself, probably faster to do so
>>>
>>>
>>>
>>> *--ex3. using collapse**
>>> *this might be your best option if you have a lot of datapoints to
>>> calculate/store*!
>>> preserve
>>> collapse (mean) v3 , by(v1 v2)
>>> keep v2 v3
>>> keep in 2/5 //parse
>>> l
>>> restore
>>>
>>>
>>> *--ex4. using summarize**
>>> forval x = 4(-1)1 {
>>> forval y = 3(-1)1 {
>>> qui sum v3 if v1==`x' & v2 == `y', meanonly
>>> loc val`x' `r(mean)'
>>> preserve
>>> clear
>>> set obs 1
>>> g name = "`x' and `y'"
>>> g v1 = `val`x'' in 1
>>> append using master.dta
>>> sa master.dta, replace //values you need are in this dta file
>>> restore
>>> } //end of y loop
>>> } //end of x loop
>>> *********************! End Example
>>> note: -timer- was reseting after the internal programming of -logout-
>>> was clearing the timer each time, so I just added up across the -rmsg-
>>> timings.
>>>
>>>
>>>
>>> HTH,
>>>
>>> Eric
>>> ___
>>> Eric A. Booth
>>> Research Scientist
>>> Gibson Consulting Group
>>> [email protected]
>>>
>>>
>>>
>>>
>>> On Sun, Aug 18, 2013 at 4:26 PM, László Sándor <[email protected]> wrote:
>>>>
>>>>
>>>> Thanks again!
>>>>
>>>> I am not sure if those preserve-and-restore the data, but I should
>>>> check.
>>>>
>>>> I think what will happen is that I log the -tab, sum()-, and somehow
>>>> read in numbers from the log file without opening a new dataset, and
>>>> plot "immediately" with -scatteri-.
>>>>
>>>> Laszlo
>>>>
>>>> On Sun, Aug 18, 2013 at 5:04 PM, Roger B. Newson
>>>> <[email protected]> wrote:
>>>>>
>>>>> One way of doing what you want is probably to use the -xcontract- and
>>>>> -xcollapse- packages, which you can download from SSC. These are
>>>>> extended
>>>>> versions of -collapse- and -contract-, which can save the output
>>>>> datasets
>>>>> (or resultssets) to Stata .dta files on disk, with which the user can
>>>>> do all
>>>>> kinds of plotting and tabulating.
>>>>>
>>>>>
>>>>> Best wishes
>>>>>
>>>>> Roger
>>>>>
>>>>> Roger B Newson BSc MSc DPhil
>>>>> Lecturer in Medical Statistics
>>>>> Respiratory Epidemiology and Public Health Group
>>>>> National Heart and Lung Institute
>>>>> Imperial College London
>>>>> Royal Brompton Campus
>>>>> Room 33, Emmanuel Kaye Building
>>>>> 1B Manresa Road
>>>>> London SW3 6LR
>>>>> UNITED KINGDOM
>>>>> Tel: +44 (0)20 7352 8121 ext 3381
>>>>> Fax: +44 (0)20 7351 8322
>>>>> Email: [email protected]
>>>>> Web page: http://www.imperial.ac.uk/nhli/r.newson/
>>>>> Departmental Web page:
>>>>>
>>>>> http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/
>>>>>
>>>>> Opinions expressed are those of the author, not of the institution.
>>>>>
>>>>> On 18/08/2013 21:49, László Sándor wrote:
>>>>>>
>>>>>>
>>>>>> Thanks, Roger.
>>>>>>
>>>>>> I never meant that StataCorp should give away their source. I was only
>>>>>> hoping to squeeze out some more interoperability. And so much of the
>>>>>> rest of the code is in smaller chunks. Not -tabulate-, I see.
>>>>>>
>>>>>> I should have thought of -which-.
>>>>>>
>>>>>> I only wanted to capture some of the results/output without logging
>>>>>> and parsing the log.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Laszlo
>>>>>>
>>>>>> On Sun, Aug 18, 2013 at 4:31 PM, Roger B. Newson
>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>>
>>>>>>> I think you'll find that everything really is in the executable
>>>>>>> "/Applications/Stata/StataMP.app/Contents/MacOS/StataMP". This is
>>>>>>> because
>>>>>>> Stata is not open-source, and was never supposed to be. StataCorp
>>>>>>> have to
>>>>>>> make a living, and would probably not be able to do so if it was
>>>>>>> open-source
>>>>>>> and users could make generic copies.
>>>>>>>
>>>>>>> A lot of the code for a lot of official Stata is open-source (ie in
>>>>>>> ado-files), but -tabulate- isn't. If you type, in Stata,
>>>>>>>
>>>>>>> which tabulate
>>>>>>>
>>>>>>> then Stata will answer
>>>>>>>
>>>>>>> built-in command: tabulate
>>>>>>>
>>>>>>> meaning that there is no file -tabulate.ado-.
>>>>>>>
>>>>>>> I hope this helps.
>>>>>>>
>>>>>>> Best wishes
>>>>>>>
>>>>>>> Roger
>>>>>>>
>>>>>>> Roger B Newson BSc MSc DPhil
>>>>>>> Lecturer in Medical Statistics
>>>>>>> Respiratory Epidemiology and Public Health Group
>>>>>>> National Heart and Lung Institute
>>>>>>> Imperial College London
>>>>>>> Royal Brompton Campus
>>>>>>> Room 33, Emmanuel Kaye Building
>>>>>>> 1B Manresa Road
>>>>>>> London SW3 6LR
>>>>>>> UNITED KINGDOM
>>>>>>> Tel: +44 (0)20 7352 8121 ext 3381
>>>>>>> Fax: +44 (0)20 7351 8322
>>>>>>> Email: [email protected]
>>>>>>> Web page: http://www.imperial.ac.uk/nhli/r.newson/
>>>>>>> Departmental Web page:
>>>>>>>
>>>>>>>
>>>>>>> http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/popgenetics/reph/
>>>>>>>
>>>>>>> Opinions expressed are those of the author, not of the institution.
>>>>>>>
>>>>>>>
>>>>>>> On 18/08/2013 21:21, László Sándor wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I am trying to understand how -tabulate, summarize- works. I
>>>>>>>> understand that much of it is written in C code, but I would still
>>>>>>>> expect to find some black boxes of files that do the magic. I think
>>>>>>>> I
>>>>>>>> checked all folders, incl. hidden folders within /Applications/Stata
>>>>>>>> on my mac, and even checked the package contents of
>>>>>>>> /Applications/Stata/StataMP. I found no trace of -tabulate-, or any
>>>>>>>> other plugin/DLL whatsoever. Is everything wrapped into the Unix
>>>>>>>> executable "/Applications/Stata/StataMP.app/Contents/MacOS/StataMP"?
>>>>>>>> Really?
>>>>>>>>
>>>>>>>> As I only need the results of -tab, sum()-, I hope to see some code
>>>>>>>> calling -_tab.ado- or some other code to display the results. Is
>>>>>>>> everything in the compiled binary instead?
>>>>>>>>
>>>>>>>> Well, something must add up those 33.9 MBs…
>>>>>>>>
>>>>>>>> Thanks for any thoughts,
>>>>>>>>
>>>>>>>> Laszlo
>>>>>>>>
>>>>>>>> *
>>>>>>>> * For searches and help try:
>>>>>>>> * http://www.stata.com/help.cgi?search
>>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>>>>>
>>>>>>> *
>>>>>>> * For searches and help try:
>>>>>>> * http://www.stata.com/help.cgi?search
>>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>>>
>>>>>>
>>>>>>
>>>>>> *
>>>>>> * For searches and help try:
>>>>>> * http://www.stata.com/help.cgi?search
>>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>>>
>>>>> *
>>>>> * For searches and help try:
>>>>> * http://www.stata.com/help.cgi?search
>>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>> * http://www.ats.ucla.edu/stat/stata/
>>>>
>>>>
>>>> *
>>>> * For searches and help try:
>>>> * http://www.stata.com/help.cgi?search
>>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>>> * http://www.ats.ucla.edu/stat/stata/
>>>
>>>
>>> *
>>> * For searches and help try:
>>> * http://www.stata.com/help.cgi?search
>>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>>> * http://www.ats.ucla.edu/stat/stata/
>>
>>
>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/faqs/resources/statalist-faq/
>> * http://www.ats.ucla.edu/stat/stata/
>>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/faqs/resources/statalist-faq/
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/