Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: data set larger than RAM

From   [email protected] (Richard Gates)
To   [email protected]
Subject   Re: st: data set larger than RAM
Date   Fri, 11 Nov 2005 07:08:32 -0600

Thomas Cornelissen <[email protected]> inquired about
computing a regression on a large dataset.  Bill gave him the following
> This is a follow-up on my previous question "st: data set larger than RAM" on
> handling large datasets (in the order of 10 millions of observations).
> Bill Gould adviced me to worry about numerical accuracy when using such large
> datasets.
> If I understood it right: Using the conventional regression tools may lead to
> inaccurate results in such large datasets due to problems of numerical
> precision when summing up so many numbers.
> I was adviced to - use quad precision, for instance quadcross() available in
> Mata - normalize the variables to mean 0 and variance 1 - use solver
> functionality instead of inverses - take much care and double-check results
> I have two follow-up questions about this: (1) Can I improve on this strategy
> if I replace quadcross() by doing the cross product sums manually using the
> "mean update rule" that Bill mentioned in the previous discussion? Or does
> quadcross() itself employ the mean update rule or take care of the problem by
> proceeding from the smallest to the largest number when summing up?
> (2) Should I also normalize dummy variables and categorial variables to mean
> 0 and variance 1 when computing X'X ? (I worry that this changes categorial
> variables from integers to real numbers and therefore increase the memory
> space needed.)

I propose that Thomas try using sequential accumulation QR.  This technique is
described in the Lawson and Hanson book Solving Least Squares Problems (Siam).
I quickly put together some deomstration Mata code.  It is not pollished which
I leave up to Thomas if he decides to try this approach.  Incidentally, this
type of algorithm is useful for panel data and I have used it to solve GEE's.

The first routine blkqr_init() initializes Mata global variables.  The second
routine blkqr_update() takes the X, y, and (optionally) a weight vector and
updates the QR in blocks.  On completion we perform a solveupper() on the
global matrix _R and _Qy. The global scalar _sse contains the sum of squares
error.  To get the sum of squares regression execute sum(_Qy[2::_nvar,1]:^2).
In the call to blkqr_init we specify that we do not want the QR algorithm to 
pivot the first column of X since this is the column coding the intercept.

I left out the wrapup code blkqr_final() where I envision returning the R, Qy,
rank, ssr, and sse then disposing of the globals.  If rank<_nvar then ssr =
sum(_Qy[2::rank,1]:^2) and sse = _sse + sum(_Qy[rank::_nvar,1]:^2)

void function blkqr_init(real scalar nvar, |real colvector fixed)
        external real matrix _R, _Qy
        external real rowvector _p, _fixed
        external real scalar _call, _nvar, _sse
        /* first call                                           */
        if (nvar < 1) error(503)
        _R = J(0,0,.)
        pragma unused _Qy
        _nvar = nvar
        _p = J(1,_nvar,0)
        if (args() == 2) {
                if (any(fixed:<1||fixed:>_nvar)) error(503)
                _fixed = fixed
                _p[_fixed] = J(1,length(_fixed),1)
        else {
                _fixed = J(1,0,.)
        _call = 0
        /* TODO: handle more than one response variable         */
        _sse = 0

void function _blkqr_update(real matrix X, real matrix y,| real colvector w)
        real scalar m, n
        real matrix tau
        external real matrix _R, _Qy
        external real rowvector _p, _fixed
        external real scalar _call, _nvar, _sse

        m = cols(X)
        if (m != _nvar) error(503)
        n = rows(X)
        if (args() == 3) {
                /* weights                                      */
                if (length(w) != n) {
                        /* TODO: informative error message      */
                X = X:*sqrt(w)
        if (_call > 0) {
                /* update                                       */
                if (rows(y) != n) {
                        /* TODO: informative error message      */
                X = (_R[.,invorder(_p)]\X)
                y = (_Qy\y)
                n = n + _nvar
                _R = J(0,0,.)
        _p = J(1,_nvar,0)
        if (length(_fixed)) _p[_fixed] = 1
        tau = J(0,0,.)

        _hqrdp(X, tau, _R, _p)
        _Qy = hqrdmultq(X, tau, y, 1)
        _sse = _sse + sum(_Qy[(_nvar+1)::n,1]:^2)
        _Qy = _Qy[1::_nvar,1]                                                               
        _call = _call + 1

/* TODO: wrapup computations
void function blkqr_final(real matrix R, real matrix Qy)
        /* TODO: determine the rank                             */
. mata
-------------- mata (type end to exit) -----------------------------------
: Z = J(0,6,.)

: z = J(0,1,.)

: blkqr_init(6,1)

: for (i=1; i<=10; i++) {
>         X = (J(10,1,1),uniform(10,5))
>         y = invnormal(uniform(10,1))
>         Z = (Z\X)
>         z = (z\y)
>         _blkqr_update(X,y)
> }

: solveupper(_R,_Qy)[invorder(_p)]
  1 |  -.4116474784  |
  2 |   .0551046759  |
  3 |   .0731321585  |
  4 |     .67404872  |
  5 |    .508685736  |
  6 |  -.0039066072  |

: _sse

: sum(_Qy[2::_nvar,1]:^2)

: b = qrsolve(Z,z)

: b
  1 |  -.4116474784  |
  2 |   .0551046759  |
  3 |   .0731321585  |
  4 |     .67404872  |
  5 |    .508685736  |
  6 |  -.0039066072  |

: sse = sum((z-Z*b):^2)

: sse

: sum((z:-sum(z)/length(z)):^2)-sse

: end

Happy computing :)

[email protected]


*   For searches and help try:

© Copyright 1996–2025 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index