Bill,
Thank you for the detailed explanation. That was exactly what I needed.
Julian
--------------------------------------------------
Date: Wed, 06 Dec 2006 12:46:33 -0600
From: [email protected] (William Gould, Stata)
Subject: Re: st: Query dataset size
Julian Reif <[email protected]> writes,
> Thanks for the information. The "data + overhead" number matches what is
> returned from -describe-. However, it doesn't look like -memory- saves this
> value in r() either. What does the r(N_cur) value represent?
Julian wants to obtain "data + overhead", reported under that name by
- -memory- and reported as "size" by describe. Call the number X. X is
defined
X = ( r(width) + r(size_ptr) ) * _N
------- -----------
/ \
/ \
from -describe- from -memory-
I.e.,
quietly memory
local size_ptr = r(size_ptr)
quietly describe
local X = ( r(width) + `size_ptr' ) * _N
Actually, if you are using Stata/MP, there are two pointers per observation,
so the formula is X = ( r(width) + 2*r(size_ptr) ) * _N, but we'll ignore
that.
Julian also asked about r(N_cur) reported by -memory-. This will take some
explaining. You know Stata keeps the data in memory. Let's talk about
that.
The data look like this
a pointer
per obs.
\
\ | <------- w bytes ------> |
+---+--------------------------+
| | var1 var2 ... |
| | |
| | | <- each line is an obs.
| | |
| | |
| | |
+---+--------------------------+
The width (w) of an observation is just the sum of the widths of the
invdividual variables. For auto.dta, that width is 43 (r(width) returned
by -describe-). Thus, the data themselves require w*_N bytes. Associated
with each observation is a "pointer" -- something technical Stata needs.
The width of that pointer varies across computers. On 32-bit computers,
the pointer is 4 bytes wide. On 64-bit computers, the the pointer is
8 bytes wide.
The above is the basis of the calculation we just made.
The data exist in a block of memory that is wider and longer than the
data themselves. This way, you can add extra variables or extra observations.
The picture looks like this:
a pointer
per obs.
\
\ | <------- w bytes ------> |
+---+--------------------------+----------------------+
(obs 1) | | var1 var2 ... | |
(obs 2) | | | |
. | | | |
. | | | |
. | | | |
(obs _N) | | | |
+---+--------------------------+ |
(obs _N+1) | | |
. | | |
. | | |
. | | |
. | | |
(obs N_cur) | | |
+---+-------------------------------------------------+
| < --------------- w_cur bytes ----------------> |
The total number of bytes is N_cur*(size_ptr + w_cur).
To answer Julian's question, N_cur is the maximum number of observations that
can be stored GIVEN THE CURRENT PARTITIONING. That is not the same as the
maximum number of observations because Stata silently changes the current
partitioning -- holding the area constant -- when necessary. If you start
adding lots of variables, Stata will increase w_cur at the expense of N_cur.
If you instead add lots of observations, Stata will increase N_cur while
reducing w_cur.
Changing the partitioning sounds easy, but it is not.
Who cares?
We at StataCorp care, because we have to verify that everything is working
before we ship. So -memory- saves in r() a number of things that interest us,
and we have test scripts that put Stata through its paces and verify that
these internal values change in the way they should. If they don't, Stata
would run more slowly and, in the worst case, could actually corrupt your
data. Anyway, recorded by -memory- are things like r(n_repart), the number of
repartioning operations performed by Stata, r(n_shift), the number of shift
operations (which I haven't described), and the characteristics of the current
state. With that information, we can design tests that move Stata to a
different state, and then we can put the test in a do-file, and we can
use -assert- to verify that the values before and after are just what they
should be. And we can check that Stata did not do too many repartionings,
(or too few) or shifts, all of which affect performance.
We don't usually talk about this, but Stata's memory manager is an important
reason Stata is so fast.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/