Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Store datafile at minimum possible file size
From
Henrik Stovring <[email protected]>
To
<[email protected]>
Subject
Re: st: Store datafile at minimum possible file size
Date
Fri, 16 Apr 2010 19:14:58 +0200
Please excuse me for advertising packages written by myself, but you may
find the -zipsave-package useful, as it includes a -zipuse- and a
-zipmerge- command that make the zip-files more readily accessible.
Best,
Henrik
Michael Boehm wrote:
> Thanks again, both of these suggestions sound like I could make
> profitable use of them :)
>
> Michael
>
> On Fri, Apr 16, 2010 at 2:55 PM, Pavlos C. Symeou <[email protected]> wrote:
>> Well, from my experience, I just had to try this to surprise myself. I had
>> an enormous dataset 14.5G consisting of 600 string variables and more than
>> 35000 observations. Exporting the dataset to tab-separated format resulted
>> in a file of about 800M. Compressing it to the Zip format resulted in a file
>> a bit less than 18M. That is an amazing difference. However, the problem
>> always remains, at least in my case, when the time for analysis comes. I
>> will still have to convert the compressed file back to the .dta format, and
>> then get back to the 14.5G. At least I can save all my files on a single
>> memory stick:)
>>
>> Cheers,
>>
>> Pavlos
>>
>> On 16/04/2010 14:49, [email protected] wrote:
>>> -zipfile- has already been mentioned.
>>>
>>> Inside Stata you can use -encode- to change a string var to numeric with
>>> value labels.
>>> In case you have a lot of string repetitions in the data this can shrink
>>> the file size to a small fraction.
>>> With -decode- you can always go back.
>>>
>>> ***
>>>
>>> You can even output the encoded file to ASCII and restore the value labels
>>> in other software by a script or a dictionary file if the small filesize is
>>> worth the extra effort.
>>> A few times I used Stata to create such a dictionary or script (e.g. in
>>> SQL).
>>>
>>>
>>> In case that all commands have the same structure (often with SQL -update-
>>> or -insert- scripts),
>>> you can use Stata's data window to "write" it. Some hints how to do this:
>>>
>>> You must do this separately for every var you want to process in this way:
>>>
>>> First -levelsof- hands the levels to a local. Do a -foreach- loop over
>>> this local.
>>> Extended macro function -label- stores the value labels created by
>>> -encode- in locals.
>>> The local names should contain the level number (like "loc123") so you can
>>> refer to it later.
>>>
>>> Now you can use -duplicates- with option "drop" to keep unique levels of
>>> this var.
>>> Delete all other vars and write commands as constant string vars.
>>> Loop over levels to insert the fitting local values (value label strings)
>>> to the numeric values.
>>> Use -order- to put all parts of the commands into the right place.
>>>
>>> Copy and paste the data editor to a text editor and you have a script.
>>>
>>> Stefan
>>>
>>>
>>> *
>>> * For searches and help try:
>>> *http://www.stata.com/help.cgi?search
>>> *http://www.stata.com/support/statalist/faq
>>> *http://www.ats.ucla.edu/stat/stata/
>>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/statalist/faq
>> * http://www.ats.ucla.edu/stat/stata/
>>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
--
Henrik Støvring Department of Biostatistics
Associate professor University of Aarhus
[email protected] Bartholins Allé 2, Bldg 1261, 217
Phone +45 8942 6131 8000 Aarhus
Fax +45 8942 6140 Denmark
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/