Giovanni Vecchi <[email protected]> observed,
> I would appreciate your comments on the following:
>
> . use auto.dta
>
> . regress price mpg
> (output omitted)
>
> .predict resid,res
>
> . egen sumres=sum(resid)
>
> . sum sumres
>
> Variable | Obs Mean Std. Dev. Min Max
> -------------+-----------------------------------------------------
> sumres | 74 -.0004654 0 -.0004654 -.0004654
>
> The sum of residuals (which should be zero according to the theory) is
> -.0004654. This estimates looks "high" to me. I run the code above in Gauss
> and obtained a much lower estimates (something like 9*10^-10).
As Scott Merryman <[email protected] has already observed,
> Thank variable type for predict is float. If you specify double you will
> get much higher precision.
>
> . use "C:\Stata\auto.dta", clear
> (1978 Automobile Data)
>
> . qui reg price mpg
>
> . predict double res, res
>
> . sum res
>
> Variable | Obs Mean Std. Dev. Min Max
> -------------+-----------------------------------------------------
> res | 74 -2.58e-13 2605.621 -3184.174 9669.721
I now wish to go further than Scott and show that, given a set of estimates
recorded in double-precison for the coefficients, the mean of -2.58e-13 is as
small as can be obtained. I do this not because it is important but merely
because we are very proud of the accuracy of the Stata code.
The problem with looking at residuals is that they are the result of
subtraction and, numerically speaking, subtraction is invariably inaccurate.
An implication of the residuals summing to zero is that the mean of the
predicted values should equal the mean of the original values. The wonderful
thing about the test stated in these terms is that it avoids subtraction
altogether. So let's make that calculation:
. predict double hat
(option xb assumed; fitted values)
. sum price
Variable | Obs Mean Std. Dev. Min Max
-------------+-----------------------------------------------------
price | 74 6165.257 2949.496 3291 15906
. scalar true = r(mean)
. sum hat
Variable | Obs Mean Std. Dev. Min Max
-------------+-----------------------------------------------------
hat | 74 6165.257 1382.124 1458.392 8386.329
. display r(mean)-true
0
And there it is: the result is exactly 0. There is not one detectable bit of
inaccuracy at double precision. That is a very neat result. I hasten to add
that the result is also not of great importance, numerically speaking, but we
are proud of it.
So how is it that if the means are exactly equal, the sum of the residuals
is not also exactly zero? The former implies the latter:
Sum(y1)/N - Sum(y2)/N = 0
=> Sum(y1/N - y2/N) = 0
=> Sum( (y1-y2)/N) = 0
=> Sum( (y1-y2)) = 0
The answer has to do with the calculation of the (y1-y2) term. Whenever
computers calculate a difference, they lose precision. The fact that
Giovanni, performing a float-accuracy calcuation, obtained a sum of -.0004654,
and that Scott, performing a double-accuracy calculation, obtained a sum of
-2.58e-13, are nothing more than byproducts of the inaccuracy of digitial
computers in making difference calculations. Both calculations amount to
summing roundoff error.
-- Bill
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/