Dear Statalist colleagues:
I've encountered (via David Stromberg) a peculiar feature of
regression with heteroskedastic-robust SEs when using dummy
variables.
If a dummy variable takes the value of 1 for a single observation,
and zeros for the rest, some strange things happen:
1. The robust SEs still look quite plausible.
2. The F-stat is reported as missing. There is a hyperlink for the
missing F-stat in the regression output (Stata v7) but it doesn't
mention the singleton dummy as a possible explanation.
3. The robust var-cov matrix is not of full rank. Invert it and one
of the row/columns becomes all zeros (but not necessarily the one
corresponding to the singleton dummy).
Does anybody have any ideas on how to interpret this? Are the robust
SEs usable anyway? Is the robust var-cov matrix still usable?
I should note that singleton dummies are not so unusual. For
example, one longstanding recommendation for dealing with an outlier
is to create a dummy for it. It would seem that this recommendation
isn't compatible with using robust SEs at the same time.
A demonstration with the infamous auto.dta follows.
--Mark
. use d:\stata\auto, replace
(1978 Automobile Data)
.
. gen singledummy=0
. replace singledummy=1 if _n==1
(1 real change made)
<Standard regression, no robust, nothing unusual>
.
. regress weight length singledummy
Source | SS df MS Number of obs = 74
-------------+------------------------------ F( 2, 71) = 302.43
Model | 39461973.9 2 19730986.9 Prob > F = 0.0000
Residual | 4632204.50 71 65242.317 R-squared = 0.8949
-------------+------------------------------ Adj R-squared = 0.8920
Total | 44094178.4 73 604029.841 Root MSE = 255.43
------------------------------------------------------------------------------
weight | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
length | 33.01849 1.342694 24.59 0.000 30.34123 35.69575
singledummy | -26.00488 257.1827 -0.10 0.920 -538.8127 486.8029
_cons | -3185.434 254.1358 -12.53 0.000 -3692.167 -2678.702
------------------------------------------------------------------------------
<Var-cov matrix is full rank>
. mat Vinv=syminv(e(V))
. mat list Vinv
symmetric Vinv[3,3]
length singledummy _cons
length 40.614269
singledummy .00285091 .00001533
_cons .2131592 .00001533 .00113423
<Same regression with robust, and strange things happen>
. regress weight length singledummy, robust
<SEs look similar to non-robust above, but F-stat is missing>
Regression with robust standard errors Number of obs = 74
F( 1, 71) = .
Prob > F = .
R-squared = 0.8949
Root MSE = 255.43
------------------------------------------------------------------------------
| Robust
weight | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
length | 33.01849 1.279353 25.81 0.000 30.46753 35.56945
singledummy | -26.00488 30.18275 -0.86 0.392 -86.18757 34.17782
_cons | -3185.434 242.0935 -13.16 0.000 -3668.155 -2702.713
------------------------------------------------------------------------------
<Var-cov matrix isn't full rank>
. mat Vinv=syminv(e(V))
. mat list Vinv
symmetric Vinv[3,3]
length singledummy _cons
length 0
singledummy 0 .00114255
_cons 0 .00002822 .00001776
Prof. Mark E. Schaffer
Director
Centre for Economic Reform and Transformation
Department of Economics
School of Management & Languages
Heriot-Watt University, Edinburgh EH14 4AS UK
44-131-451-3494 direct
44-131-451-3008 fax
44-131-451-3485 CERT administrator
http://www.som.hw.ac.uk/cert
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/