Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Henrik Stovring <stovring@BIOSTAT.AU.DK> |
To | <statalist@hsphsun2.harvard.edu> |
Subject | Re: st: Store datafile at minimum possible file size |
Date | Fri, 16 Apr 2010 19:14:58 +0200 |
Please excuse me for advertising packages written by myself, but you may find the -zipsave-package useful, as it includes a -zipuse- and a -zipmerge- command that make the zip-files more readily accessible. Best, Henrik Michael Boehm wrote: > Thanks again, both of these suggestions sound like I could make > profitable use of them :) > > Michael > > On Fri, Apr 16, 2010 at 2:55 PM, Pavlos C. Symeou <p.symeou@lmu.de> wrote: >> Well, from my experience, I just had to try this to surprise myself. I had >> an enormous dataset 14.5G consisting of 600 string variables and more than >> 35000 observations. Exporting the dataset to tab-separated format resulted >> in a file of about 800M. Compressing it to the Zip format resulted in a file >> a bit less than 18M. That is an amazing difference. However, the problem >> always remains, at least in my case, when the time for analysis comes. I >> will still have to convert the compressed file back to the .dta format, and >> then get back to the 14.5G. At least I can save all my files on a single >> memory stick:) >> >> Cheers, >> >> Pavlos >> >> On 16/04/2010 14:49, Stefan.Gawrich@hlpug.hessen.de wrote: >>> -zipfile- has already been mentioned. >>> >>> Inside Stata you can use -encode- to change a string var to numeric with >>> value labels. >>> In case you have a lot of string repetitions in the data this can shrink >>> the file size to a small fraction. >>> With -decode- you can always go back. >>> >>> *** >>> >>> You can even output the encoded file to ASCII and restore the value labels >>> in other software by a script or a dictionary file if the small filesize is >>> worth the extra effort. >>> A few times I used Stata to create such a dictionary or script (e.g. in >>> SQL). >>> >>> >>> In case that all commands have the same structure (often with SQL -update- >>> or -insert- scripts), >>> you can use Stata's data window to "write" it. Some hints how to do this: >>> >>> You must do this separately for every var you want to process in this way: >>> >>> First -levelsof- hands the levels to a local. Do a -foreach- loop over >>> this local. >>> Extended macro function -label- stores the value labels created by >>> -encode- in locals. >>> The local names should contain the level number (like "loc123") so you can >>> refer to it later. >>> >>> Now you can use -duplicates- with option "drop" to keep unique levels of >>> this var. >>> Delete all other vars and write commands as constant string vars. >>> Loop over levels to insert the fitting local values (value label strings) >>> to the numeric values. >>> Use -order- to put all parts of the commands into the right place. >>> >>> Copy and paste the data editor to a text editor and you have a script. >>> >>> Stefan >>> >>> >>> * >>> * For searches and help try: >>> *http://www.stata.com/help.cgi?search >>> *http://www.stata.com/support/statalist/faq >>> *http://www.ats.ucla.edu/stat/stata/ >>> >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/statalist/faq >> * http://www.ats.ucla.edu/stat/stata/ >> > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > -- Henrik Støvring Department of Biostatistics Associate professor University of Aarhus stovring@biostat.au.dk Bartholins Allé 2, Bldg 1261, 217 Phone +45 8942 6131 8000 Aarhus Fax +45 8942 6140 Denmark * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/