Thomas Cornelissen <[email protected]> asks a question that
begins,
> Suppose you had K=5000 regressors and N=10 million observations.
I have a lot to say about that opening statement. The summary will be that
this is not just a memory problem, but that very large datasets not just
quantitatively different from smaller datasets, they are qualitatively
different, too. These break down into Numerical Issues and Substantive
Issues.
Numerical issues
----------------
I am usually pretty relaxed about numerical issues. StataCorp is not, but I,
personally, am.
Many of you learned that regression coefficients can be calculated via
b = (X'X)^(-1)X'y. They can be, but those who worry about numerical
accuracy, do not use that formula.
The first inaccuracy has to do with simply forming X'X and X'y. It can be
formed more accurately if you first remove the means, and so calculate
(X-xbar)'(X-Xbar) and (X-Xbar)'(y-ybar). In many problems, this makes a world
of difference. One can go further, too, and normalize each of the variables
to have mean 0 *AND* to have variance 1. That will help numerically, but
there are arguments both ways considering whether that is desirable. I don't
want to get into that right now, except to say that I, personally, am not a
fan of the extra normalization given how most people use statistical packages.
Stata does not do the variance=1 part. If it did, in many of cases where
Stata drops a variable as collinear and you are happy about that, Stata would
not. On the other hand, there are a few cases where Stata drops a variable
and someone wishes it had been left in. I have never seen a real case of
that, however, meaning a real researcher working with real data. Numerical
analyst friends have concocted cases for me.
The second inaccuracy as to do with forming the inverse of (X-Xbar)'(X-Xbar).
Stata forms (X-Xbar)'(X-Xbar) and (X-Xbar)'(y-ybar), but then it does *NOT*
use the formula ((X-xbar)'(X-bar))^(-1)(X-xbar)'(y-ybar) to obtain b.
Rather, it uses a "solve" approach:
Solve ((X-xbar)'X-Xbar))b = (X-Xbar)'(y-ybar)
that results in many less additions being carried out and results in a
more accurate answer.
The third inaccuracy has to do with the calculations for the intercept, but
never mind. Stata does that well, too.
All that said, in most cases, b = (X'X)^(-1)X'y usually produces an accurate
answer. It produces an accurate answer if (1) the data are not too
correlated, (2) the means of X and y are all roughly equal, and (3) there are
not too many observations.
Okay, so Stata looks to be very careful about how it makes the
linear-regression calculation, so once we are around the memory problem,
we are set to run regressions on 10 million, or a billion observations,
right?
No. As the number of obsrvations grows, just calculating a sum becomes
problematic.
Stata calculates sums by addition in the usual way. If Stata needs to sum
x1, x2, x3, ..., Stata adds them. For n very, very large, that does not
work well. This is what I meant when I said very large datasets are
qualitatively, not just quantitatively, different. A new problem arises.
First, let's understand the problem. To add numbers, computers must
"normalize" them. If I want to add 1e+5 and 1e-5, I must first
write them so that the digits line up:
10000
.0001
----------
10000.0001
That's called normalization; it's just a fancy word for something we all do
without even thinking about it. But computers have finite accuracy.
Pretend our computer had 7 digits of accuracy. That means it can record
only 7 digits, and we are left with:
12345 67|
|
10000.00|
.00|01
-----------
10000.00|
On this computer, 1e+5+1e-5 = 1e+5. The 1e-5 vanishes!
Does this problem matter? A little loss of 1e-5 in a number as
big as 1e+5 is not much, I admit. Now let's consider adding the
following numbers
1e+5 + 1e-5 + 1e-5 + 1e-5 + ... + 1e-5
-------------------------------
assume I have 1 million of these numbers
The true sum is 1e+5 + (1e+6)*(1e-5) is 10,010.
What will the computer calculation yield? Let's proceed from left to
right: First, we add 1e+5 to 1e-5 to obtain 1e+5, then we add
1e-5 to obtain 1e+5, ..., and we obtain 1e+5, or 10,000. We have so
much error that adding the terms did absolutely nothing!
If we proceeded from right to left, we would get 10,010.
The order in which we add is important. In a very large dataset, the
accurate way to sum x1, x2, ..., is to order them from smallest to largest,
and then add those, left to right.
There are other, less accurate but less computationally intensive ways
to address this problem, too. One is the mean update rule:
If we knew the mean accurately, we could multiply by N to obtain the
sum. Let M(i) be the mean of the first i elements. Then,
M(i+1) = M(i)*((i)/(i+1) + x_i/(i+1)
Now let's return to real life.
Stata does not use the mean-update rule in linear regression, and I know of
no package that does. It doesn't matter for the size of datasets we
usually deal with.
In terms of the formulas above, we use computers that provide 8-byte doubles,
and that amounts to approximately 16 decimal digits of accuracy. Increasing
the number of digits, as I have shown, aleviates the problem. Stata, when it
calculates sums, uses double precision and sometimes, when Stata is especially
worried about the problem (as it is with -stcox-), it uses quad precision.
Quad precision provides a bit of 32 decimal digits. (Stata is unique or
rather unique on its use of quad precision. I say rather unique because I
assume some other package must, but I not know of one.)
There are other numeric problems with very large datasets as well, which
I will not go into here. They may turn out to be important, but they are
usually minor compared to the summation problem.
Substantive issues
------------------
Let us assume that we somehow get a b which is an accurate representation of
(X'X)^(-1)X'y performed in infinite precision, and that we similarly
have an accurate representation of s^2(X'X)^(-1), the variance matrix, the
squareroot of the diagonal elements of which are standard errors.
What do they mean? I ask you to think about statistics. The standard errors
reflect the sampling error. At 10-million observations, the sampling error
is not much.
Standard errors are theoretical lower bounds about the amount of uncertainty,
theoretical because they assume the *ONLY* source of error is sampling error,
and in particular, they ignore specification error. The uncertainty about
the specification probably now exceeds the sampling error.
At 10 million observations, we may very well have the population. In that
case, populationwise, the uncertainty is zero. Is the researcher reporting
standard errors because he or she is interested in alternative universes? I
doubt it: the researcher is still uncertain because he or she does not trust
that he or she has estimated the correct model. The standard errors that
statistics provides us do not address that issue.
Advice
------
I find these fascinating problems, numerically and substantively. I also
believe that these are problems of growing importance. My advice to anyone
interested in breaking into the statistical software industry is to carve out
these problems and start working on them, substantively and numerically.
Produce software for supercomputers.
Thomas probably does not want to do this, however.
So Thomas is in a difficult position. I offer the following advice:
1. Whatever software you use -- Stata, SAS, ... -- do *NOT* assume
you are getting correct answers on anything. You must verify
as best you can.
2. Draw a reasonable-sized (but large) subsample from the data.
Perform all analyses on that, and compare results to those
obtained from the full dataset.
3. Make sure all variables have reasonable and roughly equal means
*AND* standard deviations. These is especially important if you
are not using Stata.
I am *NOT* saying that these problems I have outlined are going to affect
Thomas with his 10-million observation, 5,000-variable dataset. I do not
know; it all depends, and it depends on too many things to outline
here. One could write a book. All I can say is that Thomas is standing
near the precipice. In these kinds of problems, we throw around powers
of 2 and 10. For Thomas' data, the problem may not strike until 100-million
observations, or a billion, but Thomas needs to be cautious.
That is why I suggest (2), and I suggest (2) even more strongly if Thomas
is going to make any of his own calculations.
Stata provides the tools Thomas needs for the full calculation.
I said to normalize variables to have mean 0, variance 1. Thomas can
trust the means and stanard deviations reported by -summarize-.
-summarize- is one of Stata's routines that uses quad precision.
Thomas should learn about Mata. For calculating X'X and X'y, Mata provides
cross() and quadcross(), the latter being quad precision. I believe that quad
preicison routines are the solution to most of Thomas' problems. Thomas
should learn about Mata's solvers so that he can efficiently solve for b
rather than basing results on inverses. On this, too, I can offer
reassurance: there are no better routines available, period, than the
LAPACK-based routines that Mata provides.
Good luck.
-- Bill
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/