Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Regression with about 5000 (dummy) variables
From
John Antonakis <[email protected]>
To
[email protected]
Subject
Re: st: Regression with about 5000 (dummy) variables
Date
Thu, 19 Apr 2012 22:30:28 +0200
Hi:
Suppose the fixed-effects are idcode and south.
clear
webuse nlswork
xtset idcode
bys idcode : egen double cl_age_id = mean(age)
bys south : egen double cl_age_south = mean(age)
reg ln_w age i.south i.idcode, cluster(idcode)
This gives:
Linear regression Number of obs =
28502
F( 1, 4709)
= .
Prob > F
= .
R-squared =
0.6643
Root MSE =
.30322
(Std. Err. adjusted for 4710 clusters in
idcode)
------------------------------------------------------------------------------
| Robust
ln_wage | Coef. Std. Err. t P>|t| [95% Conf.
Interval]
-------------+----------------------------------------------------------------
age | .0181924 .0006658 27.32 0.000 .0168872
.0194977
1.south | -.0774963 .0195974 -3.95 0.000 -.1159164
-.0390761
|
idcode |
2 | -.3705713 .0006658 -556.59 0.000 -.3718765
-.369266
[snip]
5159 | -.3570145 .0207303 -17.22 0.000 -.3976556
-.3163734
|
_cons | 1.561366 .0175324 89.06 0.000 1.526995
1.595738
------------------------------------------------------------------------------
Notice, we have run out of DF (with a cluster-robust vce); the overall
F-test cannot be computed. Had we not used a cluster robust vce, we
would have had 4711 degrees of freedom in the numerator of the F-test:
reg ln_w age i.south i.idcode,
Source | SS df MS Number of obs
= 28502
-------------+------------------------------ F(4711, 23790)
= 9.99
Model | 4328.36582 4711 .918778566 Prob > F =
0.0000
Residual | 2187.3642 23790 .091944691 R-squared =
0.6643
-------------+------------------------------ Adj R-squared =
0.5978
Total | 6515.73002 28501 .228614085 Root MSE =
.30322
------------------------------------------------------------------------------
ln_wage | Coef. Std. Err. t P>|t| [95% Conf.
Interval]
-------------+----------------------------------------------------------------
age | .0181924 .0003475 52.35 0.000 .0175113
.0188736
1.south | -.0774963 .0112551 -6.89 0.000 -.099557
-.0554355
|
idcode |
2 | -.3705713 .1237911 -2.99 0.003 -.6132097
-.1279328
[snip]
5159 | -.3570145 .1446902 -2.47 0.014 -.6406165
-.0734125
|
_cons | 1.561366 .0880103 17.74 0.000 1.388861
1.733872
------------------------------------------------------------------------------
When we use xtreg, we get:
iis idcode
xtregreg ln_w age i.south fe cluster(idcode)
Fixed-effects (within) regression Number of obs =
28502
Group variable: idcode Number of groups
= 4710
R-sq: within = 0.1044 Obs per group: min
= 1
between = 0.1233 avg
= 6.1
overall = 0.1062 max
= 15
F(2,4709) =
455.05
corr(u_i, Xb) = 0.0818 Prob > F =
0.0000
(Std. Err. adjusted for 4710 clusters in
idcode)
------------------------------------------------------------------------------
| Robust
ln_wage | Coef. Std. Err. t P>|t| [95% Conf.
Interval]
-------------+----------------------------------------------------------------
age | .0181924 .0006083 29.91 0.000 .0169999
.019385
1.south | -.0774963 .0179053 -4.33 0.000 -.112599
-.0423935
_cons | 1.178256 .0190444 61.87 0.000 1.14092
1.215592
-------------+----------------------------------------------------------------
sigma_u | .39998991
sigma_e | .30322383
rho | .63504833 (fraction of variance due to u_i)
------------------------------------------------------------------------------
Notice, there is no F-test for the fixed-effects (usually printed on the
bottom of the regression table).
Now, let's run it à la Mundlak:
. xtreg ln_w age cl*, cluster(idcode)
Random-effects GLS regression Number of obs =
28510
Group variable: idcode Number of groups
= 4710
R-sq: within = 0.1032 Obs per group: min
= 1
between = 0.1271 avg
= 6.1
overall = 0.1133 max
= 15
Wald chi2(3) =
1397.57
corr(u_i, X) = 0 (assumed) Prob > chi2 =
0.0000
(Std. Err. adjusted for 4710 clusters in
idcode)
------------------------------------------------------------------------------
| Robust
ln_wage | Coef. Std. Err. z P>|z| [95% Conf.
Interval]
-------------+----------------------------------------------------------------
age | .0182259 .0006078 29.99 0.000 .0170347
.0194171
cl_age_id | .0052512 .0012583 4.17 0.000 .002785
.0077174
cl_age_south | -.2571161 .0234542 -10.96 0.000 -.3030855
-.2111467
_cons | 8.445617 .6805643 12.41 0.000 7.111736
9.779499
-------------+----------------------------------------------------------------
sigma_u | .35875483
sigma_e | .30323734
rho | .58327855 (fraction of variance due to u_i)
------------------------------------------------------------------------------
The estimate (for age) is correct to three decimal places (it is a wee
bit off probably due to the unbalanced panel).
With OLS à la Mundlak we have:
reg ln_w age cl*, cluster(idcode)
Linear regression Number of obs =
28510
F( 3, 4709) =
493.26
Prob > F =
0.0000
R-squared =
0.1182
Root MSE =
.44897
(Std. Err. adjusted for 4710 clusters in
idcode)
------------------------------------------------------------------------------
| Robust
ln_wage | Coef. Std. Err. t P>|t| [95% Conf.
Interval]
-------------+----------------------------------------------------------------
age | .0182808 .0006088 30.03 0.000 .0170872
.0194743
cl_age_id | .0050963 .0013429 3.80 0.000 .0024637
.007729
cl_age_south | -.4121572 .0236396 -17.44 0.000 -.4585019
-.3658125
_cons | 12.9671 .6872252 18.87 0.000 11.61982
14.31439
------------------------------------------------------------------------------
The estimator still seems good. Notice, though, that the F-test
numerator DFs are only 3. So that's what I meant when I said we save on
DF (as compared to the OLS fixed-effects estimator).
Best,
J.
__________________________________________
Prof. John Antonakis
Faculty of Business and Economics
Department of Organizational Behavior
University of Lausanne
Internef #618
CH-1015 Lausanne-Dorigny
Switzerland
Tel ++41 (0)21 692-3438
Fax ++41 (0)21 692-3305
http://www.hec.unil.ch/people/jantonakis
Associate Editor
The Leadership Quarterly
__________________________________________
On 19.04.2012 17:16, Austin Nichols wrote:
> John Antonakis <[email protected]>:
> The poster asked about multiple dimensions of fixed effects--how does
> the advice below relate?
> The approach shown actually adds to the size of the matrix to be
inverted.
> You assert that
> "This will save you on degrees of freedom and computational
requirements."
> --can you clarify that claim?
> Your
> xtreg y x1-x4 cl_x1-cl_x4, cluster(panelvar)
> is nearly the same as
> xtreg y x1-x4, fe robust
> right? Note that inference is not identical, as the RE estimator
> does not "know" the means are estimated.
>
> On Thu, Apr 19, 2012 at 10:57 AM, John Antonakis
<[email protected]> wrote:
>> Hi:
>>
>> Let me let you in on a trick that is relatively unknown.
>>
>> One way around the problem of a huge amount of dummy variables is to
use the
>> Mundlak procedure:
>>
>> Mundlak, Y. (1978). Pooling of Time-Series and Cross-Section Data.
>> Econometrica, 46(1), 69-85.
>>
>> ....for an intuitive explanation, see:
>>
>> Antonakis, J., Bendahan, S., Jacquart, P., & Lalive, R. (2010). On
making
>> causal claims: A review and recommendations. The Leadership Quarterly,
>> 21(6). 1086-1120. http://www.hec.unil.ch/jantonakis/Causal_Claims.pdf
>>
>> Basically, for each time varying independent variable (x1-x4), take the
>> cluster mean and include that in the regression. That is, do:
>>
>> foreach var of varlist x1-x4 {
>> bys panelvar: egen cl_`var'=mean(`var')
>> }
>>
>> Then, run your regression like this:
>>
>> xtreg y x1-x4 cl_x1-cl_x4, cluster(panelvar)
>>
>> The Hausman test for fixed- versus random-effects is:
>>
>> testparm cl_x1-cl_x4
>>
>> This will save you on degrees of freedom and computational requirements.
>> This estimator is consistent. Try it out with a subsample of your
dataset
>> to see. Many econometricians have been amazed by this.
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/