How does the anova command handle collinearity?
Title |
|
The anova command and collinearity |
Author |
William Sribney, StataCorp |
Here is an example that illustrates what happens.
. input woman twin
woman twin
1. 1 1
2. 2 1
3. 3 2
4. 4 2
5. 5 3
6. 6 3
7. end
. tab woman, gen(w)
woman | | Freq. Percent Cum. |
| | |
1 | | 1 16.67 16.67 |
2 | | 1 16.67 33.33 |
3 | | 1 16.67 50.00 |
4 | | 1 16.67 66.67 |
5 | | 1 16.67 83.33 |
6 | | 1 16.67 100.00 |
| | |
Total | | 6 100.00 |
. tab twin, gen(t)
twin | | Freq. Percent Cum. |
| | |
1 | | 2 33.33 33.33 |
2 | | 2 33.33 66.67 |
3 | | 2 33.33 100.00 |
| | |
Total | | 6 100.00 |
. gen t1w1 = t1*w1 - t1*w2
. gen t2w3 = t2*w3 - t2*w4
. gen t3w5 = t3*w5 - t3*w6
. list w* t*, nodisplay sep(0)
| | | |
| | woman w1 w2 w3 w4 w5 w6 twin t1 t2 t3 t1w1 t2w3 t3w5 | |
| | | |
1. | | 1 1 0 0 0 0 0 1 1 0 0 1 0 0 | |
2. | | 2 0 1 0 0 0 0 1 1 0 0 -1 0 0 | |
3. | | 3 0 0 1 0 0 0 2 0 1 0 0 1 0 | |
4. | | 4 0 0 0 1 0 0 2 0 1 0 0 -1 0 | |
5. | | 5 0 0 0 0 1 0 3 0 0 1 0 0 1 | |
6. | | 6 0 0 0 0 0 1 3 0 0 1 0 0 -1 | |
| | | |
. set seed 123
. gen x = 12 - int(2*runiform())
. expand x
(63 observations created)
. gen y = runiform()
. anova y woman twin
Number of obs = 69 R-squared = 0.0251
Root MSE = .304633 Adj R-squared = -0.0523
Source | | Partial SS df MS F Prob>F |
| | | |
| Model | | .15054776 5 .03010955 0.32 0.8964 |
| | |
|
| woman | | .15054776 5 .03010955 0.32 0.8964 |
| twin | | 0 0 |
| | |
|
| Residual | | 5.8464905 63 .09280144 |
| | | |
| Total | | 5.9970382 68 .08819174 |
. regress y w1-w5 t1-t3
note: w1 omitted because of collinearity
note: w3 omitted because of collinearity
note: t1 omitted because of collinearity
Source | | SS df MS | Number of obs = 69 |
| | | F(5, 63) = 0.32 |
Model | | .150547762 5 .030109552 | Prob > F = 0.8964 |
Residual | | 5.84649045 63 .092801436 | R-squared = 0.0251 |
| | | Adj R-squared = -0.0523 |
Total | | 5.99703821 68 .088191738 | Root MSE = .30463 |
|
y | | Coef. Std. Err. t P>|t| [95% Conf. Interval] |
| | |
w1 | | .0516343 .1271611 0.41 0.686 -.2024769 .3057455 |
w2 | | 0 (omitted) |
w3 | | 0 (omitted) |
w4 | | .0359635 .1298961 0.28 0.783 -.2236131 .29554 |
w5 | | .0800831 .124366 0.64 0.522 -.1684426 .3286087 |
t1 | | 0 (omitted) |
t2 | | -.0798703 .1298961 -0.61 0.541 -.3394469 .1797063 |
t3 | | -.0642359 .1271611 -0.51 0.615 -.3183471 .1898752 |
_cons | | .5206881 .0918504 5.67 0.000 .3371397 .7042364 |
|
The regress model is obviously collinear, but so was the anova model. The
anova command
keeps terms from left to right. Hence, it “omitted” the twin
effect (i.e., all the twin dummies).
. anova y twin woman
Number of obs = 69 R-squared = 0.0251
Root MSE = .304633 Adj R-squared = -0.0523
Source | | Partial SS df MS F Prob>F |
| | | |
| Model | | .15054776 5 .03010955 0.32 0.8964 |
| | |
|
| twin | | .09122261 2 .04561131 0.49 0.6140 |
| woman | | .06089443 3 .02029814 0.22 0.8831 |
| | |
|
| Residual | | 5.8464905 63 .09280144 |
| | | |
| Total | | 5.9970382 68 .08819174 |
Again, anova keeps terms from left to right; here
it kept only three out of the six women dummies.
. anova y twin twin#woman
Number of obs = 69 R-squared = 0.0251
Root MSE = .304633 Adj R-squared = -0.0523
Source | | Partial SS df MS F Prob>F |
| | | |
| Model | | .15054776 5 .03010955 0.32 0.8964 |
| | | |
| twin | | .0872024 2 .0436012 0.47 0.6273 |
| twin#woman | | .06089443 3 .02029814 0.22 0.8831 |
| | | |
| Residual | | 5.8464905 63 .09280144
|
| | | |
| Total | | 5.9970382 68 .08819174 |
Below, we do the equivalent regression.
. regress y t1 t2 t1w1 t2w3 t3w5
Source | | SS df MS | Number of obs = 69 |
| | | F(5, 63) = 0.32 |
Model | | .150547762 5 .030109552 | Prob > F = 0.8964 |
Residual | | 5.84649045 63 .092801436 | R-squared = 0.0251 |
| | | Adj R-squared = -0.0523 |
Total | | 5.99703821 68 .088191738 | Root MSE = .30463 |
|
y | | Coef. Std. Err. t P>|t| [95% Conf. Interval] |
| | |
t1 | | .0500115 .0889338 0.56 0.576 -.1277084 .2277315 |
t2 | | -.0376941 .0899165 -0.42 0.676 -.2173779 .1419896 |
t1w1 | | .0258171 .0635806 0.41 0.686 -.1012385 .1528727 |
t2w3 | | -.0179817 .064948 -0.28 0.783 -.14777 .1118066 |
t3w5 | | .0400415 .062183 0.64 0.522 -.0842213 .1643044 |
_cons | | .4964937 .062183 7.98 0.000 .3722309 .6207565 |
|
. test t1 t2
( 1) t1 = 0
( 2) t2 = 0
F( 2, 63) = 0.47
Prob > F = 0.6273
I made the interactions orthogonal, which is essentially what
anova does.
. test t1w1 t2w3 t3w5
( 1) t1w1 = 0
( 2) t2w3 = 0
( 3) t3w5 = 0
F( 3, 63) = 0.22
Prob > F = 0.8831
Hopefully, you understand the above Wald tests.
If not, the anova partial SS and their tests
are equivalent. I call them “added-last” tests.
The test of t1 = t2 = 0 is a test of
y = t1w1 t2w3 t3w5 t1 t2
vs.
y = t1w1 t2w3 t3w5
The following explains sequential SS:
. anova y twin twin#woman, seq
Number of obs = 69 R-squared = 0.0251
Root MSE = .304633 Adj R-squared = -0.0523
Source | | Seq. SS df MS F Prob>F |
| | | |
| Model | | .15054776 5 .03010955 0.32 0.8964 |
| | |
|
| twin | | .08965333 2 .04482667 0.48 0.6192 |
| twin#woman | | .06089443 3 .02029814 0.22 0.8831 |
| | |
|
| Residual | | 5.8464905 63 .09280144 |
| | | |
| Total | | 5.9970382 68 .08819174 |
. anova y twin
Number of obs = 69 R-squared = 0.0149
Root MSE = .299175 Adj R-squared = -0.0149
Source | | Partial SS df MS F Prob>F |
| | | |
| Model | | .08965333 2 .04482667 0.50 0.6083 |
| | |
|
| twin | | .08965333 2 .04482667 0.50 0.6083 |
| | |
|
| Residual | | 5.9073849 66 .08950583 |
| | | |
| Total | | 5.9970382 68 .08819174 |
The twin SS are the same in the two preceding
anovas. The difference in the tests is in the
denominator of the F. The residuals are obviously
different. I (and my profs) prefer the second for testing “main
effects”.
Clearly, I take a model-building approach to anova and think in terms of the
equivalent regression.
You can type regress
after running anova to view an equivalent regression.