[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: Query dataset size

From	[email protected] (William Gould, Stata)
To	[email protected]
Subject	Re: st: Query dataset size
Date	Wed, 06 Dec 2006 12:46:33 -0600
Julian Reif <[email protected]> writes, 

> Thanks for the information.  The "data + overhead" number matches what is
> returned from -describe-.  However, it doesn't look like -memory- saves this
> value in r() either.  What does the r(N_cur) value represent?

Julian wants to obtain "data + overhead", reported under that name by 
-memory- and reported as "size" by describe.  Call the number X.  X is 
defined 

	X = ( r(width) + r(size_ptr) ) * _N
	      -------    -----------
                /                 \
               /                   \
        from -describe-           from -memory-

I.e., 
		quietly memory 
		local size_ptr = r(size_ptr)
		quietly describe
		local X = ( r(width) + `size_ptr' ) * _N

Actually, if you are using Stata/MP, there are two pointers per observation, 
so the formula is X = ( r(width) + 2*r(size_ptr) ) * _N, but we'll ignore 
that.

Julian also asked about r(N_cur) reported by -memory-.  This will take some
explaining.  You know Stata keeps the data in memory.  Let's talk about 
that.  

The data look like this

        a pointer 
         per obs.
             \
              \  | <------- w bytes ------> |
             +---+--------------------------+
             |   | var1 var2 ...            |    
             |   |                          |
             |   |                          |   <- each line is an obs.
             |   |                          |
             |   |                          |
             |   |                          |
             +---+--------------------------+


The width (w) of an observation is just the sum of the widths of the 
invdividual variables.  For auto.dta, that width is 43 (r(width) returned 
by -describe-).  Thus, the data themselves require w*_N bytes.  Associated 
with each observation is a "pointer" -- something technical Stata needs.
The width of that pointer varies across computers.  On 32-bit computers, 
the pointer is 4 bytes wide.  On 64-bit computers, the the pointer is 
8 bytes wide.

The above is the basis of the calculation we just made.

The data exist in a block of memory that is wider and longer than the 
data themselves.  This way, you can add extra variables or extra observations.
The picture looks like this:

        a pointer 
         per obs.
             \
              \  | <------- w bytes ------> |
             +---+--------------------------+----------------------+
   (obs 1)   |   | var1 var2 ...            |                      |
   (obs 2)   |   |                          |                      |
      .      |   |                          |                      |
      .      |   |                          |                      |
      .      |   |                          |                      |
 (obs _N)    |   |                          |                      |
             +---+--------------------------+                      |
 (obs _N+1)  |   |                                                 |
      .      |   |                                                 |
      .      |   |                                                 |
      .      |   |                                                 |
      .      |   |                                                 |
 (obs N_cur) |   |                                                 |
             +---+-------------------------------------------------+
                 | < --------------- w_cur bytes ----------------> |

The total number of bytes is N_cur*(size_ptr + w_cur).

To answer Julian's question, N_cur is the maximum number of observations that
can be stored GIVEN THE CURRENT PARTITIONING.  That is not the same as the 
maximum number of observations because Stata silently changes the current
partitioning -- holding the area constant -- when necessary.  If you start
adding lots of variables, Stata will increase w_cur at the expense of N_cur.
If you instead add lots of observations, Stata will increase N_cur while
reducing w_cur.

Changing the partitioning sounds easy, but it is not.

Who cares?

We at StataCorp care, because we have to verify that everything is working
before we ship.  So -memory- saves in r() a number of things that interest us,
and we have test scripts that put Stata through its paces and verify that
these internal values change in the way they should.  If they don't, Stata
would run more slowly and, in the worst case, could actually corrupt your
data.  Anyway, recorded by -memory- are things like r(n_repart), the number of
repartioning operations performed by Stata, r(n_shift), the number of shift
operations (which I haven't described), and the characteristics of the current
state.  With that information, we can design tests that move Stata to a 
different state, and then we can put the test in a do-file, and we can 
use -assert- to verify that the values before and after are just what they 
should be.  And we can check that Stata did not do too many repartionings, 
(or too few) or shifts, all of which affect performance.

We don't usually talk about this, but Stata's memory manager is an important
reason Stata is so fast.  

Ada Ma <[email protected]> asked last week why Stata uses large
(contiguous) blocks of memory rather than obtaining memory in smaller blocks
as most applications do.  Ada wrote, 

> My colleague told me today that an IT guy once said to her that the
> requirement of contiguous memory is unique to Stata, and there are
> statistical packages which do not need that.  Before, I have always thought
> that contiguous memory is needed for all sort of computing programs and in
> the case of Stata, if there is a "contiguous memory problem", then the OS is
> to be blamed.  I don't know where this guy is or else I'd have hunt him down
> to grill him some more about this comment, [...]

The answer is because we manage the memory ourselves so that we use code that
has been optimized for the kinds of problems Stata faces.  Modern operating
systems (and that includes Microsoft's new Vista) have no difficulty
delivering memory in large blocks.

-- Bill
[email protected]
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Prev by Date: Re: st: coefficients of implausible magnitude in ivprobit
Next by Date: st: std dev for a certain range for all obs
Previous by thread: Re: st: Query dataset size
Next by thread: Re: st: Query dataset size
Index(es):
- Date
- Thread