Thomas Cornelissen <[email protected]> inquired about
computing a regression on a large dataset. Bill gave him the following
feedback:
>
> This is a follow-up on my previous question "st: data set larger than RAM" on
> handling large datasets (in the order of 10 millions of observations).
>
> Bill Gould adviced me to worry about numerical accuracy when using such large
> datasets.
>
> If I understood it right: Using the conventional regression tools may lead to
> inaccurate results in such large datasets due to problems of numerical
> precision when summing up so many numbers.
>
> I was adviced to - use quad precision, for instance quadcross() available in
> Mata - normalize the variables to mean 0 and variance 1 - use solver
> functionality instead of inverses - take much care and double-check results
>
> I have two follow-up questions about this: (1) Can I improve on this strategy
> if I replace quadcross() by doing the cross product sums manually using the
> "mean update rule" that Bill mentioned in the previous discussion? Or does
> quadcross() itself employ the mean update rule or take care of the problem by
> proceeding from the smallest to the largest number when summing up?
>
> (2) Should I also normalize dummy variables and categorial variables to mean
> 0 and variance 1 when computing X'X ? (I worry that this changes categorial
> variables from integers to real numbers and therefore increase the memory
> space needed.)
>
I propose that Thomas try using sequential accumulation QR. This technique is
described in the Lawson and Hanson book Solving Least Squares Problems (Siam).
I quickly put together some deomstration Mata code. It is not pollished which
I leave up to Thomas if he decides to try this approach. Incidentally, this
type of algorithm is useful for panel data and I have used it to solve GEE's.
The first routine blkqr_init() initializes Mata global variables. The second
routine blkqr_update() takes the X, y, and (optionally) a weight vector and
updates the QR in blocks. On completion we perform a solveupper() on the
global matrix _R and _Qy. The global scalar _sse contains the sum of squares
error. To get the sum of squares regression execute sum(_Qy[2::_nvar,1]:^2).
In the call to blkqr_init we specify that we do not want the QR algorithm to
pivot the first column of X since this is the column coding the intercept.
I left out the wrapup code blkqr_final() where I envision returning the R, Qy,
rank, ssr, and sse then disposing of the globals. If rank<_nvar then ssr =
sum(_Qy[2::rank,1]:^2) and sse = _sse + sum(_Qy[rank::_nvar,1]:^2)
mata:
void function blkqr_init(real scalar nvar, |real colvector fixed)
{
external real matrix _R, _Qy
external real rowvector _p, _fixed
external real scalar _call, _nvar, _sse
/* first call */
if (nvar < 1) error(503)
_R = J(0,0,.)
pragma unused _Qy
_nvar = nvar
_p = J(1,_nvar,0)
if (args() == 2) {
if (any(fixed:<1||fixed:>_nvar)) error(503)
_fixed = fixed
_p[_fixed] = J(1,length(_fixed),1)
}
else {
_fixed = J(1,0,.)
}
_call = 0
/* TODO: handle more than one response variable */
_sse = 0
}
void function _blkqr_update(real matrix X, real matrix y,| real colvector w)
{
real scalar m, n
real matrix tau
external real matrix _R, _Qy
external real rowvector _p, _fixed
external real scalar _call, _nvar, _sse
m = cols(X)
if (m != _nvar) error(503)
n = rows(X)
if (args() == 3) {
/* weights */
if (length(w) != n) {
/* TODO: informative error message */
error(503)
}
X = X:*sqrt(w)
}
if (_call > 0) {
/* update */
if (rows(y) != n) {
/* TODO: informative error message */
error(503)
}
X = (_R[.,invorder(_p)]\X)
y = (_Qy\y)
n = n + _nvar
_R = J(0,0,.)
}
_p = J(1,_nvar,0)
if (length(_fixed)) _p[_fixed] = 1
tau = J(0,0,.)
_hqrdp(X, tau, _R, _p)
_Qy = hqrdmultq(X, tau, y, 1)
_sse = _sse + sum(_Qy[(_nvar+1)::n,1]:^2)
_Qy = _Qy[1::_nvar,1]
_call = _call + 1
}
/* TODO: wrapup computations
void function blkqr_final(real matrix R, real matrix Qy)
{
/* TODO: determine the rank */
}
*/
end
. mata
-------------- mata (type end to exit) -----------------------------------
: Z = J(0,6,.)
: z = J(0,1,.)
:
:
: blkqr_init(6,1)
:
: for (i=1; i<=10; i++) {
> X = (J(10,1,1),uniform(10,5))
> y = invnormal(uniform(10,1))
>
> Z = (Z\X)
> z = (z\y)
>
> _blkqr_update(X,y)
> }
:
: solveupper(_R,_Qy)[invorder(_p)]
1
+----------------+
1 | -.4116474784 |
2 | .0551046759 |
3 | .0731321585 |
4 | .67404872 |
5 | .508685736 |
6 | -.0039066072 |
+----------------+
: _sse
91.48053052
: sum(_Qy[2::_nvar,1]:^2)
6.276821
:
: b = qrsolve(Z,z)
: b
1
+----------------+
1 | -.4116474784 |
2 | .0551046759 |
3 | .0731321585 |
4 | .67404872 |
5 | .508685736 |
6 | -.0039066072 |
+----------------+
: sse = sum((z-Z*b):^2)
: sse
91.48053052
: sum((z:-sum(z)/length(z)):^2)-sse
6.276821
: end
---------------------------------------------------------------------
Happy computing :)
--Rich
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/