Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Store datafile at minimum possible file size
From
"Pavlos C. Symeou" <[email protected]>
To
"[email protected]" <[email protected]>
Subject
Re: st: Store datafile at minimum possible file size
Date
Fri, 16 Apr 2010 15:55:02 +0200
Well, from my experience, I just had to try this to surprise myself. I
had an enormous dataset 14.5G consisting of 600 string variables and
more than 35000 observations. Exporting the dataset to tab-separated
format resulted in a file of about 800M. Compressing it to the Zip
format resulted in a file a bit less than 18M. That is an amazing
difference. However, the problem always remains, at least in my case,
when the time for analysis comes. I will still have to convert the
compressed file back to the .dta format, and then get back to the 14.5G.
At least I can save all my files on a single memory stick:)
Cheers,
Pavlos
On 16/04/2010 14:49, [email protected] wrote:
-zipfile- has already been mentioned.
Inside Stata you can use -encode- to change a string var to numeric with value labels.
In case you have a lot of string repetitions in the data this can shrink the file size to a small fraction.
With -decode- you can always go back.
***
You can even output the encoded file to ASCII and restore the value labels in other software by a script or a dictionary file if the small filesize is worth the extra effort.
A few times I used Stata to create such a dictionary or script (e.g. in SQL).
In case that all commands have the same structure (often with SQL -update- or -insert- scripts),
you can use Stata's data window to "write" it. Some hints how to do this:
You must do this separately for every var you want to process in this way:
First -levelsof- hands the levels to a local. Do a -foreach- loop over this local.
Extended macro function -label- stores the value labels created by -encode- in locals.
The local names should contain the level number (like "loc123") so you can refer to it later.
Now you can use -duplicates- with option "drop" to keep unique levels of this var.
Delete all other vars and write commands as constant string vars.
Loop over levels to insert the fitting local values (value label strings) to the numeric values.
Use -order- to put all parts of the commands into the right place.
Copy and paste the data editor to a text editor and you have a script.
Stefan
*
* For searches and help try:
*http://www.stata.com/help.cgi?search
*http://www.stata.com/support/statalist/faq
*http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/