Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Store datafile at minimum possible file size


From   Michael Boehm <[email protected]>
To   [email protected]
Subject   Re: st: Store datafile at minimum possible file size
Date   Fri, 16 Apr 2010 15:39:32 +0100

Thanks again, both of these suggestions sound like I could make
profitable use of them :)

Michael

On Fri, Apr 16, 2010 at 2:55 PM, Pavlos C. Symeou <[email protected]> wrote:
> Well, from my experience, I just had to try this to surprise myself. I had
> an enormous dataset 14.5G consisting of 600 string variables and more than
> 35000 observations. Exporting the dataset to tab-separated format resulted
> in a file of about 800M. Compressing it to the Zip format resulted in a file
> a bit less than 18M. That is an amazing difference. However, the problem
> always remains, at least in my case, when the time for analysis comes. I
> will still have to convert the compressed file back to the .dta format, and
> then get back to the 14.5G. At least I can save all my files on a single
> memory stick:)
>
> Cheers,
>
> Pavlos
>
> On 16/04/2010 14:49, [email protected] wrote:
>>
>> -zipfile- has already been mentioned.
>>
>> Inside Stata you can use -encode- to change a string var to numeric with
>> value labels.
>> In case you have a lot of string repetitions in the data this can shrink
>> the file size to a small fraction.
>> With -decode- you can always go back.
>>
>> ***
>>
>> You can even output the encoded file to ASCII and restore the value labels
>> in other software by a script or a dictionary file if the small filesize is
>> worth the extra effort.
>> A few times I used Stata to create such a dictionary or script (e.g. in
>> SQL).
>>
>>
>> In case that all commands have the same structure (often with SQL -update-
>> or -insert- scripts),
>> you can use Stata's data window to "write" it. Some hints how to do this:
>>
>> You must do this separately for every var you want to process in this way:
>>
>> First -levelsof- hands the levels to a local. Do a -foreach- loop over
>> this local.
>> Extended macro function -label- stores the value labels created by
>> -encode- in locals.
>> The local names should contain the level number (like "loc123") so you can
>> refer to it later.
>>
>> Now you can use -duplicates- with option "drop" to keep unique levels of
>> this var.
>> Delete all other vars and write commands as constant string vars.
>> Loop over levels to insert the fitting local values (value label strings)
>> to the numeric values.
>> Use -order- to put all parts of the commands into the right place.
>>
>> Copy and paste the data editor to a text editor and you have a script.
>>
>> Stefan
>>
>>
>> *
>> *   For searches and help try:
>> *http://www.stata.com/help.cgi?search
>> *http://www.stata.com/support/statalist/faq
>> *http://www.ats.ucla.edu/stat/stata/
>>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index