Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: -collapsetofile-
From
Andrew Maurer <[email protected]>
To
Statalist Statalist <[email protected]>
Subject
st: -collapsetofile-
Date
Fri, 28 Feb 2014 18:06:20 +0000
Hi Statalist,
I've written a pair of program -collapsetofile- and -recover- to allow users to "collapse" data to a file without destroying the dataset like -collapse- does. I don't know if anyone else will have use for this, but it will save me a lot of computer time when dealing with large datasets. I would be very interested if anyone has any input or comments on how to improve coding efficiency / style (the code is still a bit rough).
ado file (collapsetofile.ado): http://codepad.org/DcwtvDEb
ado file (recover.ado) : http://codepad.org/csZhQvb0
sthlp file (collapsetofile.sthlp): http://codepad.org/AsKC79uK
The biggest improvement would come from being able to save directly to a .dta. I assume that this would require either:
1) looking at the format/header/footer of stata dtas in clear text and fwrite()'ing it from mata, and/or
2) looking at the source for a command like save and just copying that (is the source for -save- available?)
Before writing this I found myself waiting for hours when graphing summary statistics of large datasets with sequences of:
use fulldata // this could be >10gb
preserve
collapse (sum) thisvar thatvar, by(byvar1 byvar2)
... some data manipulation
twoway line...
restore
preserve
collapse (sum) anothervar yetanothervar, by(byvar3)
... some data manipulation
twoway line...
restore
...
preserve
collapse (sum) more vars, by(byvar10)
... some data manipulation
twoway line...
restore
For a 20gb dataset with 10 graphs, that makes 10 preserves/restores * 20gb = 200gb written/read to disk. -collapsetofile- writes just the collapsed data to be graphed to a file with no other disk reads/writes:
use fulldata
collapsetofile (sum) thisvar thatvar using dataforgraph1, by(byvar1 byvar2)
collapsetofile (sum) anothervar yetanothervar dataforgraph2, by(byvar3)
...
collapsetofile (sum) more vars, by(byvar10)
recover dataforgraph1, clear
... some data manipulation
twoway line...
...
recover dataforgraph2, clear
... some data manipulation
twoway line...
...
Thanks to Nick Cox for mentioning the importance of saving characteristics/metadata with the dataset.
Thanks to Sergiy Radyakin for making me realize that I could never write a mata program that would compute stats "by" variables as fast as stata's -_mean- in -collapse-, since stata's built-in C code can take advantage of parallelization, while mata code cannot.
Thanks,
Andrew Maurer
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/