[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: data set larger than RAM

From	[email protected] (Richard Gates)
To	[email protected]
Subject	Re: st: data set larger than RAM
Date	Fri, 11 Nov 2005 07:08:32 -0600
Thomas Cornelissen <[email protected]> inquired about
computing a regression on a large dataset.  Bill gave him the following
feedback:
>
> This is a follow-up on my previous question "st: data set larger than RAM" on
> handling large datasets (in the order of 10 millions of observations).
>
> Bill Gould adviced me to worry about numerical accuracy when using such large
> datasets.
>
> If I understood it right: Using the conventional regression tools may lead to
> inaccurate results in such large datasets due to problems of numerical
> precision when summing up so many numbers.
>
> I was adviced to - use quad precision, for instance quadcross() available in
> Mata - normalize the variables to mean 0 and variance 1 - use solver
> functionality instead of inverses - take much care and double-check results
>
> I have two follow-up questions about this: (1) Can I improve on this strategy
> if I replace quadcross() by doing the cross product sums manually using the
> "mean update rule" that Bill mentioned in the previous discussion? Or does
> quadcross() itself employ the mean update rule or take care of the problem by
> proceeding from the smallest to the largest number when summing up?
>
> (2) Should I also normalize dummy variables and categorial variables to mean
> 0 and variance 1 when computing X'X ? (I worry that this changes categorial
> variables from integers to real numbers and therefore increase the memory
> space needed.)
>

I propose that Thomas try using sequential accumulation QR.  This technique is
described in the Lawson and Hanson book Solving Least Squares Problems (Siam).
I quickly put together some deomstration Mata code.  It is not pollished which
I leave up to Thomas if he decides to try this approach.  Incidentally, this
type of algorithm is useful for panel data and I have used it to solve GEE's.

The first routine blkqr_init() initializes Mata global variables.  The second
routine blkqr_update() takes the X, y, and (optionally) a weight vector and
updates the QR in blocks.  On completion we perform a solveupper() on the
global matrix _R and _Qy. The global scalar _sse contains the sum of squares
error.  To get the sum of squares regression execute sum(_Qy[2::_nvar,1]:^2).
In the call to blkqr_init we specify that we do not want the QR algorithm to 
pivot the first column of X since this is the column coding the intercept.

I left out the wrapup code blkqr_final() where I envision returning the R, Qy,
rank, ssr, and sse then disposing of the globals.  If rank<_nvar then ssr =
sum(_Qy[2::rank,1]:^2) and sse = _sse + sum(_Qy[rank::_nvar,1]:^2)

mata:
void function blkqr_init(real scalar nvar, |real colvector fixed)
{
        external real matrix _R, _Qy
        external real rowvector _p, _fixed
        external real scalar _call, _nvar, _sse
                                                                                             
        /* first call                                           */
        if (nvar < 1) error(503)
        _R = J(0,0,.)
        pragma unused _Qy
        _nvar = nvar
        _p = J(1,_nvar,0)
        if (args() == 2) {
                if (any(fixed:<1||fixed:>_nvar)) error(503)
                _fixed = fixed
                _p[_fixed] = J(1,length(_fixed),1)
        }
        else {
                _fixed = J(1,0,.)
        }
        _call = 0
        /* TODO: handle more than one response variable         */
        _sse = 0
}

void function _blkqr_update(real matrix X, real matrix y,| real colvector w)
{
        real scalar m, n
        real matrix tau
        external real matrix _R, _Qy
        external real rowvector _p, _fixed
        external real scalar _call, _nvar, _sse

        m = cols(X)
        if (m != _nvar) error(503)
        n = rows(X)
                                                                                             
        if (args() == 3) {
                /* weights                                      */
                if (length(w) != n) {
                        /* TODO: informative error message      */
                        error(503)
                }
                X = X:*sqrt(w)
        }
        if (_call > 0) {
                /* update                                       */
                if (rows(y) != n) {
                        /* TODO: informative error message      */
                        error(503)
                }
                X = (_R[.,invorder(_p)]\X)
                y = (_Qy\y)
                n = n + _nvar
                _R = J(0,0,.)
        }
        _p = J(1,_nvar,0)
        if (length(_fixed)) _p[_fixed] = 1
        tau = J(0,0,.)

        _hqrdp(X, tau, _R, _p)
        _Qy = hqrdmultq(X, tau, y, 1)
        _sse = _sse + sum(_Qy[(_nvar+1)::n,1]:^2)
        _Qy = _Qy[1::_nvar,1]                                                               
        _call = _call + 1
}

/* TODO: wrapup computations
void function blkqr_final(real matrix R, real matrix Qy)
{
        /* TODO: determine the rank                             */
}
*/
end
                                                                                             
. mata
-------------- mata (type end to exit) -----------------------------------
: Z = J(0,6,.)

: z = J(0,1,.)

: 
: 
: blkqr_init(6,1)

: 
: for (i=1; i<=10; i++) {
>         X = (J(10,1,1),uniform(10,5))
>         y = invnormal(uniform(10,1))
> 
>         Z = (Z\X)
>         z = (z\y)
> 
>         _blkqr_update(X,y)
> }

: 
: solveupper(_R,_Qy)[invorder(_p)]
                  1
    +----------------+
  1 |  -.4116474784  |
  2 |   .0551046759  |
  3 |   .0731321585  |
  4 |     .67404872  |
  5 |    .508685736  |
  6 |  -.0039066072  |
    +----------------+

: _sse
  91.48053052

: sum(_Qy[2::_nvar,1]:^2)
  6.276821

: 
: b = qrsolve(Z,z)

: b
                  1
    +----------------+
  1 |  -.4116474784  |
  2 |   .0551046759  |
  3 |   .0731321585  |
  4 |     .67404872  |
  5 |    .508685736  |
  6 |  -.0039066072  |
    +----------------+

: sse = sum((z-Z*b):^2)

: sse
  91.48053052

: sum((z:-sum(z)/length(z)):^2)-sse
  6.276821

: end
---------------------------------------------------------------------

Happy computing :)

--Rich
[email protected]

 

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Prev by Date: st: Problem with xtmixed
Next by Date: RE: st: graph bar labels
Previous by thread: Re: st: data set larger than RAM
Next by thread: st: Missing data examples/tutorials
Index(es):
- Date
- Thread