Daniel Muller <[email protected]> is curious about how dataset size is
calculated by -describe-:
> I have two files: <b.dta> with 6,771,434 bytes on disk and <b1.asc>
> 3,385,856 bytes (size according to Win Commander).
>
> Stata however says:
>
> Contains data from b.dta
> obs: 1,692,789
> vars: 1 5 Jul 2002 16:40
> size: 13,542,312 (87.1% of memory free)
> -------------------------------------------------------------------
> storage display value
> variable name type format label variable label
> -------------------------------------------------------------------
---
> b float %9.0g
> -------------------------------------------------------------------
> Sorted by:
The size reported by -describe- is obtained by
1,692,789 * ( 4 + 4 ) = 13,542,312
/ | \
# of obs | \
| \
width of data plus 4
1 float = 4 bytes
What is the "plus 4"? The size of the data reported by -describe- is the
size of the memory image of the data and, in transferring the data from
disk to memory, Stata adds 4 bytes to each and every observation.
That 4 bytes is for something called an "observation pointer". Observation
pointers are one of the things that make Stata fast.
When a dataset is written to disk, the observation pointers are not written
because they can be (and need to be) recreated each time the data is used.
-- Bill
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/