Julian Reif <[email protected]> writes,
> Thanks for the information. The "data + overhead" number matches what is
> returned from -describe-. However, it doesn't look like -memory- saves this
> value in r() either. What does the r(N_cur) value represent?
Julian wants to obtain "data + overhead", reported under that name by
-memory- and reported as "size" by describe. Call the number X. X is
defined
X = ( r(width) + r(size_ptr) ) * _N
------- -----------
/ \
/ \
from -describe- from -memory-
I.e.,
quietly memory
local size_ptr = r(size_ptr)
quietly describe
local X = ( r(width) + `size_ptr' ) * _N
Actually, if you are using Stata/MP, there are two pointers per observation,
so the formula is X = ( r(width) + 2*r(size_ptr) ) * _N, but we'll ignore
that.
Julian also asked about r(N_cur) reported by -memory-. This will take some
explaining. You know Stata keeps the data in memory. Let's talk about
that.
The data look like this
a pointer
per obs.
\
\ | <------- w bytes ------> |
+---+--------------------------+
| | var1 var2 ... |
| | |
| | | <- each line is an obs.
| | |
| | |
| | |
+---+--------------------------+
The width (w) of an observation is just the sum of the widths of the
invdividual variables. For auto.dta, that width is 43 (r(width) returned
by -describe-). Thus, the data themselves require w*_N bytes. Associated
with each observation is a "pointer" -- something technical Stata needs.
The width of that pointer varies across computers. On 32-bit computers,
the pointer is 4 bytes wide. On 64-bit computers, the the pointer is
8 bytes wide.
The above is the basis of the calculation we just made.
The data exist in a block of memory that is wider and longer than the
data themselves. This way, you can add extra variables or extra observations.
The picture looks like this:
a pointer
per obs.
\
\ | <------- w bytes ------> |
+---+--------------------------+----------------------+
(obs 1) | | var1 var2 ... | |
(obs 2) | | | |
. | | | |
. | | | |
. | | | |
(obs _N) | | | |
+---+--------------------------+ |
(obs _N+1) | | |
. | | |
. | | |
. | | |
. | | |
(obs N_cur) | | |
+---+-------------------------------------------------+
| < --------------- w_cur bytes ----------------> |
The total number of bytes is N_cur*(size_ptr + w_cur).
To answer Julian's question, N_cur is the maximum number of observations that
can be stored GIVEN THE CURRENT PARTITIONING. That is not the same as the
maximum number of observations because Stata silently changes the current
partitioning -- holding the area constant -- when necessary. If you start
adding lots of variables, Stata will increase w_cur at the expense of N_cur.
If you instead add lots of observations, Stata will increase N_cur while
reducing w_cur.
Changing the partitioning sounds easy, but it is not.
Who cares?
We at StataCorp care, because we have to verify that everything is working
before we ship. So -memory- saves in r() a number of things that interest us,
and we have test scripts that put Stata through its paces and verify that
these internal values change in the way they should. If they don't, Stata
would run more slowly and, in the worst case, could actually corrupt your
data. Anyway, recorded by -memory- are things like r(n_repart), the number of
repartioning operations performed by Stata, r(n_shift), the number of shift
operations (which I haven't described), and the characteristics of the current
state. With that information, we can design tests that move Stata to a
different state, and then we can put the test in a do-file, and we can
use -assert- to verify that the values before and after are just what they
should be. And we can check that Stata did not do too many repartionings,
(or too few) or shifts, all of which affect performance.
We don't usually talk about this, but Stata's memory manager is an important
reason Stata is so fast.
Ada Ma <[email protected]> asked last week why Stata uses large
(contiguous) blocks of memory rather than obtaining memory in smaller blocks
as most applications do. Ada wrote,
> My colleague told me today that an IT guy once said to her that the
> requirement of contiguous memory is unique to Stata, and there are
> statistical packages which do not need that. Before, I have always thought
> that contiguous memory is needed for all sort of computing programs and in
> the case of Stata, if there is a "contiguous memory problem", then the OS is
> to be blamed. I don't know where this guy is or else I'd have hunt him down
> to grill him some more about this comment, [...]
The answer is because we manage the memory ourselves so that we use code that
has been optimized for the kinds of problems Stata faces. Modern operating
systems (and that includes Microsoft's new Vista) have no difficulty
delivering memory in large blocks.
-- Bill
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/