|  | 
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
RE: st: File sizes in Stata & SPSS (was Weights )
Hello all,
I just want to add some observations about encoding.
When you encode a string variable, the file contains a copy of every 
distinct value. Consequently, it provides a space advantage usually 
only if many of the values are repeated. If all or most observations 
are distinct, then encoding will not gain a space advantage. (But you 
may have other reasons for encoding.)
But even when encoding is advantageous in terms of space, there is 
one situation when it can backfire; I had not though of this until it 
happened to me. I had a large file with a string variable with many 
distinct values -- though many were often repeated. I encoded it, and 
gained a significant space savings.
Later, I created a multitude of smaller subsets of this file. Each 
one had much fewer distinct values of the encoded variable. But each 
file retained the full encoding table -- more than it needed. (Each 
file replicated the encoding table.) The result was that each of the 
small files were much bigger than they really needed to be. (And the 
total size may have been much more then the original, even if there 
had been no overlap of observations.) Subsequently, I decoded the 
variable, and the files shrunk significantly.
I thought this is something to be aware of.
(It makes a potential case for having coding tables in a separate 
file. But there are plenty of reasons not to have it that way.)
--David
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/