<reposting - first time didn't get through>
Dear Statalisters:
I've run into a numerical accuracy/collinearity issue that I think might
be of interest. It relates specifically to a built-in Stata command,
-ovtest-, but I think it raises general issues.
-ovtest- implements a version of the Ramsey RESET (sometimes called an
"omitted variables test"). The textbook description of this particular
version of the test is as follows:
1. Estimate the equation using -regress-.
2. Calculate the predicted values of the dependent variable, yhat.
3. Create new variables which are yhat^2, yhat^3 and yhat^4.
4. Re-estimate the original equation, including yhat2, yhat3 and yhat4
as regressors.
5. Test yhat2, yhat3 and yhat4 for joint significance using an F test.
A large test statistic in (5) is evidence that the original equation is
misspecified.
In fact, implementing the test exactly as above does not always generate
output that matches that of -ovtest-. What sometimes happens is that
yhat2, yhat3 and yhat4 are nearly collinear with the other regressors in
step (5), and a variable gets dropped.
What Stata's -ovtest- does to avoid this is to rescale yhat so that it
lies in the unit interval. Call this step 2a:
2a. sum yhat, meanonly; replace yhat = (yhat-r(min))/(r(max)-r(min))
and in practice, this seems to eliminate collinearities.
What is curious is that the following alternative rescaling usually does
*not* eliminate the collinearites, namely first calculate yhat^2, yhat^3
and yhat^4, and *then* rescale these so that they lie in the unit
interval. Call this step 3a:
3a. sum yhat2, meanonly; replace yhat2 =
(yhat2-r(min))/(r(max)-r(min))
sum yhat3, meanonly; replace yhat3 =
(yhat3-r(min))/(r(max)-r(min))
sum yhat4, meanonly; replace yhat4 =
(yhat4-r(min))/(r(max)-r(min))
Below is an example.
Using steps 1-5 with no rescaling generates a collinearity and -regress-
drops a variable in step 5. -coldiag2- shows the condition number for
the regression in step 5 is huge: 7,454,604
Using steps 1-5 plus 3a also generates a collinearity, and -regress-
drops a variable in step 5. -coldiag2- again shows the condition number
for the regression in step 5 is huge, though a bit smaller: 1,658,268
Using steps 1-5 plus 2a, which is Stata's -ovtest- procedure, does not
generate a collinearity, and in step 5 -regress- drops nothing.
-coldiag2- shows the condition number for the regression in step 5 is
much smaller, but still way above the rule of thumb that ">30 means
collinearity problems": 538
My first question - why does the Stata method "work"?
My second question - *does* the Stata method work? Or does rescaling
followed by raising to the 2nd, 3rd and 4th power introduce numerical
inaccuracies that cause what is a "genuine" near-collinearity to
decrease so much that Stata's -regress- doesn't detect it?
Any ideas? It's not because I'm using floats. Doubles everywhere.
--Mark
***************** Example output **************** . version 8.2
. version
version 8.2
. which ovtest
C:\Stata8\ado\base\o\ovtest.ado
*! version 2.3.6 05sep2001
. which coldiag2
c:\ado8\plus\c\coldiag2.ado
*! version 2.0, 01Dec2004, [email protected]
.
. use http://fmwww.bc.edu/ec-p/data/hayashi/griliches76.dta, clear
(Wages of Very Young Men, Zvi Griliches, J.Pol.Ec. 1976)
.
. * Generate yhats
. qui regress lw s
. qui predict double yhat
. * yhatr=rescaled yhat
. sum yhat, meanonly
. qui gen double yhatr = (yhat-r(min))/(r(max)-r(min))
. qui gen double yhat2=yhat^2
. qui gen double yhat3=yhat^3
. qui gen double yhat4=yhat^4
. * yhatr2=rescaled, then ^2; similarly for yhatr3 and yhatr4 . qui gen
double yhatr2=yhatr^2
. qui gen double yhatr3=yhatr^3
. qui gen double yhatr4=yhatr^4
. * yhat2r=yhat^2 and then rescaled; similarly for yhat3r and yhat4r .
sum yhat2, meanonly
. qui gen double yhat2r = (yhat2-r(min))/(r(max)-r(min))
. sum yhat3, meanonly
. qui gen double yhat3r = (yhat3-r(min))/(r(max)-r(min))
. sum yhat4, meanonly
. qui gen double yhat4r = (yhat4-r(min))/(r(max)-r(min))
. * Summarize variables
. sum lw s yhat*
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
lw | 758 5.686739 .4289494 4.605 7.051
s | 758 13.40501 2.231828 9 18
yhat | 758 5.686739 .2156493 5.261107 6.130727
yhatr | 758 .4894459 .2479809 0 1
yhat2 | 758 32.38544 2.473995 27.67924 37.58581
-------------+--------------------------------------------------------
yhat3 | 758 184.7003 21.3139 145.6234 230.4284
yhat4 | 758 1054.929 163.4247 766.1405 1412.693
yhatr2 | 758 .3009707 .2769761 0 1
yhatr3 | 758 .2142687 .2681644 0 1
yhatr4 | 758 .1671979 .2530434 0 1
-------------+--------------------------------------------------------
yhat2r | 758 .4750582 .2497327 0 1
yhat3r | 758 .4607848 .2513286 0 1
yhat4r | 758 .4466593 .252763 0 1
.
. * Quadratic form of RESET
. * 1. Unrescaled RESET
. * Collinearity appears
. qui regress lw s yhat2 yhat3 yhat4
. testparm yhat2 yhat3 yhat4
( 1) yhat2 = 0
( 2) yhat3 = 0
( 3) yhat4 = 0
Constraint 1 dropped
F( 2, 754) = 0.87
Prob > F = 0.4191
. * 2. yhat that is first ^2, ^3, ^4, then rescaled . * Collinearity
appears . qui regress lw s yhat2r yhat3r yhat4r
. testparm yhat2r yhat3r yhat4r
( 1) yhat2r = 0
( 2) yhat3r = 0
( 3) yhat4r = 0
Constraint 1 dropped
F( 2, 754) = 0.87
Prob > F = 0.4191
. * 3. yhat that is first rescaled, then ^2, ^3, ^4 . * No collinearity
. qui regress lw s yhatr2 yhatr3 yhatr4
. testparm yhatr2 yhatr3 yhatr4
( 1) yhatr2 = 0
( 2) yhatr3 = 0
( 3) yhatr4 = 0
F( 3, 753) = 0.59
Prob > F = 0.6216
. * 4. Stata's built-in ovtest
. * Matches first-rescaled-then-powered, i.e., (3)
. qui regress lw s
. ovtest
Ramsey RESET test using powers of the fitted values of lw
Ho: model has no omitted variables
F(3, 753) = 0.59
Prob > F = 0.6216
.
. * Collinearities
. _rmcoll s yhat2 yhat3 yhat4
note: yhat2 dropped due to collinearity
. _rmcoll s yhat2r yhat3r yhat4r
note: yhat2r dropped due to collinearity
. _rmcoll s yhatr2 yhatr3 yhatr4
.
. * -coldiag2-
. coldiag2 s yhat2 yhat3 yhat4
Condition number using scaled variables = 7454604.11
Condition Indexes and Variance-Decomposition Proportions
condition
index _cons s yhat2 yhat3 yhat4
1 1.00 0.00 0.00 0.00 0.00 0.00
2 16.73 0.00 0.00 0.00 0.00 0.00
3 359.37 0.00 0.00 0.00 0.00 0.00
4 33354.15 0.00 0.00 0.00 0.00 0.00
5 7454604.11 1.00 1.00 1.00 1.00 1.00
. coldiag2 s yhat2r yhat3r yhat4r
Condition number using scaled variables = 1658268.08
Condition Indexes and Variance-Decomposition Proportions
condition
index _cons s yhat2r yhat3r yhat4r
1 1.00 0.00 0.00 0.00 0.00 0.00
2 4.69 0.00 0.00 0.00 0.00 0.00
3 168.37 0.00 0.00 0.00 0.00 0.00
4 12145.95 0.00 0.00 0.00 0.00 0.00
5 1658268.08 1.00 1.00 1.00 1.00 1.00
. coldiag2 s yhatr2 yhatr3 yhatr4
Condition number using scaled variables = 538.15
Condition Indexes and Variance-Decomposition Proportions
condition
index _cons s yhatr2 yhatr3 yhatr4
>
1 1.00 0.00 0.00 0.00 0.00 0.00
2 2.38 0.00 0.00 0.00 0.00 0.00
3 14.57 0.00 0.00 0.00 0.00 0.00
4 96.54 0.08 0.06 0.00 0.02 0.05
5 538.15 0.92 0.94 1.00 0.98 0.94
*********** do file to generate output **************
version 8.2
version
which ovtest
which coldiag2
use http://fmwww.bc.edu/ec-p/data/hayashi/griliches76.dta, clear
* Generate yhats
qui regress lw s
qui predict double yhat
* yhatr=rescaled yhat
sum yhat, meanonly
qui gen double yhatr = (yhat-r(min))/(r(max)-r(min)) qui gen double
yhat2=yhat^2 qui gen double yhat3=yhat^3 qui gen double yhat4=yhat^4
* yhatr2=rescaled, then ^2; similarly for yhatr3 and yhatr4 qui gen
double yhatr2=yhatr^2 qui gen double yhatr3=yhatr^3 qui gen double
yhatr4=yhatr^4
* yhat2r=yhat^2 and then rescaled; similarly for yhat3r and yhat4r sum
yhat2, meanonly qui gen double yhat2r = (yhat2-r(min))/(r(max)-r(min))
sum yhat3, meanonly qui gen double yhat3r =
(yhat3-r(min))/(r(max)-r(min)) sum yhat4, meanonly qui gen double yhat4r
= (yhat4-r(min))/(r(max)-r(min))
* Summarize variables
sum lw s yhat*
* Quadratic form of RESET
* 1. Unrescaled RESET
* Collinearity appears
qui regress lw s yhat2 yhat3 yhat4
testparm yhat2 yhat3 yhat4
* 2. yhat that is first ^2, ^3, ^4, then rescaled
* Collinearity appears
qui regress lw s yhat2r yhat3r yhat4r
testparm yhat2r yhat3r yhat4r
* 3. yhat that is first rescaled, then ^2, ^3, ^4
* No collinearity
qui regress lw s yhatr2 yhatr3 yhatr4
testparm yhatr2 yhatr3 yhatr4
* 4. Stata's built-in ovtest
* Matches first-rescaled-then-powered, i.e., (3)
qui regress lw s
ovtest
* Collinearities
_rmcoll s yhat2 yhat3 yhat4
_rmcoll s yhat2r yhat3r yhat4r
_rmcoll s yhatr2 yhatr3 yhatr4
* -coldiag2-
coldiag2 s yhat2 yhat3 yhat4
coldiag2 s yhat2r yhat3r yhat4r
coldiag2 s yhatr2 yhatr3 yhatr4
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/