Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: -collapsetofile-
From
Nick Cox <[email protected]>
To
"[email protected]" <[email protected]>
Subject
Re: st: -collapsetofile-
Date
Fri, 28 Feb 2014 18:19:45 +0000
-save- is part of the executable
. which save
built-in command: save
and so its code is not accessible to users.
Nick
[email protected]
On 28 February 2014 18:06, Andrew Maurer <[email protected]> wrote:
> Hi Statalist,
>
> I've written a pair of program -collapsetofile- and -recover- to allow users to "collapse" data to a file without destroying the dataset like -collapse- does. I don't know if anyone else will have use for this, but it will save me a lot of computer time when dealing with large datasets. I would be very interested if anyone has any input or comments on how to improve coding efficiency / style (the code is still a bit rough).
>
> ado file (collapsetofile.ado): http://codepad.org/DcwtvDEb
> ado file (recover.ado) : http://codepad.org/csZhQvb0
> sthlp file (collapsetofile.sthlp): http://codepad.org/AsKC79uK
>
> The biggest improvement would come from being able to save directly to a .dta. I assume that this would require either:
> 1) looking at the format/header/footer of stata dtas in clear text and fwrite()'ing it from mata, and/or
> 2) looking at the source for a command like save and just copying that (is the source for -save- available?)
>
> Before writing this I found myself waiting for hours when graphing summary statistics of large datasets with sequences of:
>
> use fulldata // this could be >10gb
> preserve
> collapse (sum) thisvar thatvar, by(byvar1 byvar2)
> ... some data manipulation
> twoway line...
> restore
>
> preserve
> collapse (sum) anothervar yetanothervar, by(byvar3)
> ... some data manipulation
> twoway line...
> restore
>
> ...
>
> preserve
> collapse (sum) more vars, by(byvar10)
> ... some data manipulation
> twoway line...
> restore
>
> For a 20gb dataset with 10 graphs, that makes 10 preserves/restores * 20gb = 200gb written/read to disk. -collapsetofile- writes just the collapsed data to be graphed to a file with no other disk reads/writes:
>
> use fulldata
> collapsetofile (sum) thisvar thatvar using dataforgraph1, by(byvar1 byvar2)
> collapsetofile (sum) anothervar yetanothervar dataforgraph2, by(byvar3)
> ...
> collapsetofile (sum) more vars, by(byvar10)
>
> recover dataforgraph1, clear
> ... some data manipulation
> twoway line...
> ...
> recover dataforgraph2, clear
> ... some data manipulation
> twoway line...
> ...
>
> Thanks to Nick Cox for mentioning the importance of saving characteristics/metadata with the dataset.
> Thanks to Sergiy Radyakin for making me realize that I could never write a mata program that would compute stats "by" variables as fast as stata's -_mean- in -collapse-, since stata's built-in C code can take advantage of parallelization, while mata code cannot.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/