Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Store datafile at minimum possible file size
From
Henrik Stovring <[email protected]>
To
<[email protected]>
Subject
Re: st: Store datafile at minimum possible file size
Date
Fri, 16 Apr 2010 19:32:29 +0200
Martin,
The zipfile command works on already stored data-files (as far as I can
tell), while my commands are basically equivalents of the save, use, and
merge commands. In other words, if you use my commands, you bypass the
step of having the actual .dta-dataset residing on your disk, as only a
.dta.zip file is created in your directory of choice. The compression
itself is done on a dataset that is "automagically" stored in your
Stata-sessions temporary directory (/tmp for example on a Linux
machine), and this dataset is removed by Stata without further ado, when
your Stata session ends.
Best,
Henrik
Martin Weiss wrote:
> <>
>
> Henrik,
>
> how does your package compare to the now official -zipfile- command?
>
>
> HTH
> Martin
>
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Henrik Stovring
> Sent: Freitag, 16. April 2010 19:15
> To: [email protected]
> Subject: Re: st: Store datafile at minimum possible file size
>
> Please excuse me for advertising packages written by myself, but you may
> find the -zipsave-package useful, as it includes a -zipuse- and a
> -zipmerge- command that make the zip-files more readily accessible.
>
> Best,
>
> Henrik
>
> Michael Boehm wrote:
>> Thanks again, both of these suggestions sound like I could make
>> profitable use of them :)
>>
>> Michael
>>
>> On Fri, Apr 16, 2010 at 2:55 PM, Pavlos C. Symeou <[email protected]> wrote:
>>> Well, from my experience, I just had to try this to surprise myself. I had
>>> an enormous dataset 14.5G consisting of 600 string variables and more than
>>> 35000 observations. Exporting the dataset to tab-separated format resulted
>>> in a file of about 800M. Compressing it to the Zip format resulted in a file
>>> a bit less than 18M. That is an amazing difference. However, the problem
>>> always remains, at least in my case, when the time for analysis comes. I
>>> will still have to convert the compressed file back to the .dta format, and
>>> then get back to the 14.5G. At least I can save all my files on a single
>>> memory stick:)
>>>
>>> Cheers,
>>>
>>> Pavlos
>>>
>>> On 16/04/2010 14:49, [email protected] wrote:
>>>> -zipfile- has already been mentioned.
>>>>
>>>> Inside Stata you can use -encode- to change a string var to numeric with
>>>> value labels.
>>>> In case you have a lot of string repetitions in the data this can shrink
>>>> the file size to a small fraction.
>>>> With -decode- you can always go back.
>>>>
>>>> ***
>>>>
>>>> You can even output the encoded file to ASCII and restore the value labels
>>>> in other software by a script or a dictionary file if the small filesize is
>>>> worth the extra effort.
>>>> A few times I used Stata to create such a dictionary or script (e.g. in
>>>> SQL).
>>>>
>>>>
>>>> In case that all commands have the same structure (often with SQL -update-
>>>> or -insert- scripts),
>>>> you can use Stata's data window to "write" it. Some hints how to do this:
>>>>
>>>> You must do this separately for every var you want to process in this way:
>>>>
>>>> First -levelsof- hands the levels to a local. Do a -foreach- loop over
>>>> this local.
>>>> Extended macro function -label- stores the value labels created by
>>>> -encode- in locals.
>>>> The local names should contain the level number (like "loc123") so you can
>>>> refer to it later.
>>>>
>>>> Now you can use -duplicates- with option "drop" to keep unique levels of
>>>> this var.
>>>> Delete all other vars and write commands as constant string vars.
>>>> Loop over levels to insert the fitting local values (value label strings)
>>>> to the numeric values.
>>>> Use -order- to put all parts of the commands into the right place.
>>>>
>>>> Copy and paste the data editor to a text editor and you have a script.
>>>>
>>>> Stefan
>>>>
>>>>
>>>> *
>>>> * For searches and help try:
>>>> *http://www.stata.com/help.cgi?search
>>>> *http://www.stata.com/support/statalist/faq
>>>> *http://www.ats.ucla.edu/stat/stata/
>>>>
>>> *
>>> * For searches and help try:
>>> * http://www.stata.com/help.cgi?search
>>> * http://www.stata.com/support/statalist/faq
>>> * http://www.ats.ucla.edu/stat/stata/
>>>
>> *
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/statalist/faq
>> * http://www.ats.ucla.edu/stat/stata/
>>
>
--
Henrik Støvring Department of Biostatistics
Associate professor University of Aarhus
[email protected] Bartholins Allé 2, Bldg 1261, 217
Phone +45 8942 6131 8000 Aarhus
Fax +45 8942 6140 Denmark
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/