Hi,
I have three questions pertaining to the discussion from three weeks ago
about panel-data models, and in particular, about -regress, cluster(id).
David Drukker explained that (1) -xtreg, re- and -xtreg, fe- produce more
efficient estimates than -regress, cluster(id)- and (2) If the covariates
are correlated with the id level error component, u_i, then only -xtreg,
fe- produces consistent estimates.
In light of these conditions, when is -regress, cluster(id)- a recommended
estimation strategy?
I have a model in which a Hausman test fails to reject the random-effects
model. Yet, the coefficient estimates I obtain from the -xtreg, re - are
quite different than those that I obtain using -regress, cluster(id). Can
anyone suggest why this might occur or what it indicates?
Finally, when I run -regress, cluster(id) - the results do not yield an
F-statistic. What does this indicate?
I include the results from the -xtreg, re- and the -regress, cluster(id)
models below.
Thanks. Daniel
. xi:reg lnprice herf pgcirc1 markets entry i.year i.group, cluster(mag1) ,
i.year _Iyear_1988-2001 (naturally coded; _Iyear_1988 omitted)
i.group _Igroup_1-53 (naturally coded; _Igroup_1 omitted)
Regression with robust standard errors Number of obs = 4174
F( 51, 541) = .
Prob > F = .
R-squared = 0.1990
Number of clusters (mag1) = 542 Root MSE = .43841
Robust
lnprice Coef. Std. Err. t P>t [95% Conf. Interval]
herf .129881 .1397176 0.93 0.353 -.1445744 .4043365
pgcirc1 -.0000367 .0000131 -2.80 0.005 -.0000625 -.0000109
markets -.0002362 .0035526 -0.07 0.947 -.0072148 .0067425
entry -.007691 .0235849 -0.33 0.744 -.0540203 .0386383
(I cut out the coefficients on a long list of dummies)
_cons .9513429 .0748093 12.72 0.000 .8043905 1.098295
. xi:xtreg lnprice herf pgcirc1 markets entry i.year i.group, re i(mag1) , if
> year>=1990 & newchg~=1 & group~=29 & group~=28 & group~=22 & price~=0 &
cove
> r~=0 & issues>3
i.year _Iyear_1988-2001 (naturally coded; _Iyear_1988 omitted)
i.group _Igroup_1-53 (naturally coded; _Igroup_1 omitted)
Random-effects GLS regression Number of obs = 4174
Group variable (i) : mag1 Number of groups = 542
R-sq: within = 0.0456 Obs per group: min = 1
between = 0.2198 avg = 7.7
overall = 0.1812 max = 11
Random effects u_i ~ Gaussian Wald chi2(54) = 318.26
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
lnprice Coef. Std. Err. z P>z [95% Conf. Interval]
herf -.0030196 .0569924 -0.05 0.958 -.1147227 .1086835
pgcirc1 -.0000211 3.93e-06 -5.37 0.000 -.0000288 -.0000134
markets .0009128 .0010585 0.86 0.389 -.0011619 .0029874
entry -.0252725 .0110294 -2.29 0.022 -.0468898 -.0036552
(I cut out the coefficients on a long list of dummies)
_cons 1.127733 .1519215 7.42 0.000 .8299721 1.425493
sigma_u .44605988
sigma_e .15335763
rho .89429288 (fraction of variance due to u_i)
.
end of do-file
At 10:31 AM 6/26/2002 -0500, you wrote:
John Neumann <[email protected]> began an interesting thread on this list
yesterday when he asked whether he should use -reg ,cluster(id)-, -xtreg ,
re- or -xtreg, fe- for estimation and inference about his panel data model.
Then anirban basu <[email protected]> and Mark Schaffer
<[email protected]> both provided interesting responses to John's
original question.
Both Anirban and Mark have pointed out that -regress , cluster(id)- will
provide consistent estimates of the coefficients. There is also agreement
that -xtreg, re- will provide consistent estimates of the coefficients. But
there seems to some discussion of whether -xtreg, fe- will provide consistent
estimates.
In theory, all three estimators (-regress ,cluster(id)-, -xtreg,re- ,
-xtreg, fe-) are consistent estimators of the coefficients for
random-effects data generating processes. To flush the details, consider
the random-effects data generating process
y_it = X_it b + u_i + e_it
where X_it is a 1 x K vector of covariates, b is K x 1 vector of coefficients,
u_i is identically, independently distributed (iid) over id's,
e_it is iid over the observations, and there is no correlation between
X_it and u_i. Under these assumptions, all three estimators all provide
consistent estimators of the VCE matrix and the resulting Wald tests will
obtain nominal coverage, given enough data.
Another theoretical point is in order. For the random-effects data
generating process -xtreg, re- should provide more efficient estimates of
the coefficients than either of the other two. While -xtreg, fe- should
produce more efficient estimates than -regress, cluster(id)-. (One caveat in
this case is that inference is said to be conditional on the random-effects in
the sample.)
To illustrate these points, I have written a small simulation. The do file
is appended to the below my signature. Breifly the program
i) produces 1000 draws from a parameterization of the
random-effects data generating process
ii) runs the three estimators on each sample, saving off the
coefficients
iii) uses -test- to test that coefficients are equal to their true
values, saving off p-values
iv) then computes the coverage rates obtained by each estimator on
on each test
Let's begin looking at the results for the coefficients.
First, we need to understand the variable names. As can be seen from the
program fevclust.do, appended below,
Variable name Meaning
x1_crg -coefficient on x1 from -regress, cluster(id)-
x2_crg -coefficient on x2 from -regress, cluster(id)-
x1_cfe -coefficient on x1 from -xtreg, fe-
x2_cfe -coefficient on x2 from -xtreg, fe-
x1_cre -coefficient on x1 from -xtreg, fe-
x2_cre -coefficient on x2 from -xtreg, fe-
Now for these results. The table below presents the summary statistics from
these variables obtained over the 1000 samples that were generated.
Variable | Obs Mean Std. Dev. Min Max
-------------+-----------------------------------------------------
x1_crg | 1000 2.99791 .1070938 2.540678 3.372749
x2_crg | 1000 3.001715 .1044514 2.71244 3.303503
x1_cfe | 1000 3.000676 .051932 2.847759 3.144626
x2_cfe | 1000 3.003031 .0496754 2.845199 3.160065
x1_cre | 1000 3.000508 .0515863 2.846256 3.150228
x2_cre | 1000 3.00292 .0498531 2.839417 3.163176
There are several points to note. The mean of the estimates of each
estimator is very close to the true value of 3.0. Second the standard
deviation of the estimates from -regress, cluster(id)- is about twice the
standard deviations from the other two estimators. This indicates that
-xtreg,re- and -xtreg,fe- are more efficient that -regress, cluster(id)-.
Third, the standard deviation of the estimates from -xtreg, fe- are
surprisingly close to those of -xtreg, re-. This indicates that for this
parameterization of the data generating process and sample size, -xtreg, fe-
is as efficient an estimator as -xtreg, re-.
Now let's consider coverage. The results table below contains the means of
6 binary variables from the 1000 generated samples. In each sample, each
variable is 1 if the test in question was rejected for that sample and zero
otherwise. Thus the means in the table below can be interpreted as
emprical coverage rates.
x1_rjrg fraction of tests in which the true null that x1=3 was rejected
after -reg ,cluster(id)
x2_rjrg fraction of tests in which the true null that x2=3 was rejected
after -reg ,cluster(id)
x1_rjfe fraction of tests in which the true null that x1=3 was rejected
after -xtreg, fe-
x2_rjfe fraction of tests in which the true null that x2=3 was rejected
after -xtreg, re-
x1_rjre fraction of tests in which the true null that x1=3 was rejected
after -xtreg, re-
x2_rjre fraction of tests in which the true null that x2=3 was rejected
after -xtreg, fe-
And the results are
Variable | Obs Mean Std. Dev. Min Max
-------------+-----------------------------------------------------
x1_rjrg | 1000 .05 .218054 0 1
x2_rjrg | 1000 .07 .2552747 0 1
x1_rjfe | 1000 .054 .2261308 0 1
x2_rjfe | 1000 .039 .1936918 0 1
x1_rjre | 1000 .059 .2357426 0 1
x2_rjre | 1000 .042 .2006895 0 1
Note the all the tests are reasonable close to nominal coverage. Also note
that the tests after -xtreg, re- are marginally closer to nominal than those
after -xtreg, fe-.
There is one final point that must be made. The crutial assumption in the
above data generating process is that X_it and u_i are not correlated. If
they are correlated only -xtreg, fe- will provide consistent estimates.
Below -fevclust.do-, I have append a second program, called fe_ex.do, that
illustrates this point. -fe_ex.do- generates a single large sample from the
same structure as above, EXCEPT that there is correlation between X_it and
u_i.
Here are the crutial correlations in our sample
. corr x1 x2 ui
(obs=5000)
| x1 x2 ui
-------------+---------------------------
x1 | 1.0000
x2 | 0.6081 1.0000
ui | 0.6349 0.7170 1.0000
Since all true values of the coefficients are 3.0, the output below
illustrates, -regress, cluster(id)- is not consistent for this data
generating process.
. regress y x1 x2,cluster(id)
Regression with robust standard errors Number of obs = 5000
F( 2, 999)
=34624.91
Prob >
F = 0.0000
R-squared =
0.9674
Number of clusters (id) = 1000 Root MSE = 1.6459
------------------------------------------------------------------------------
| Robust
y | Coef. Std. Err. t P>|t| [95% Conf.
Interval]
-------------+----------------------------------------------------------------
x1
| 3.498473 .024788 141.14 0.000 3.449831 3.547116
x2
| 3.701595 .0232616 159.13 0.000 3.655948 3.747242
_cons
| 2.97323 .0336344 88.40 0.000 2.907228 3.039232
------------------------------------------------------------------------------
In contrast, the output below illustrates -xtreg, fe- is a consistent
estimator for the coefficients with this data generating process.
. xtreg y x1 x2, fe i(id)
Fixed-effects (within) regression Number of obs = 5000
Group variable (i) : id Number of groups = 1000
R-sq: within = 0.9602 Obs per group: min = 5
between = 0.9885 avg
= 5.0
overall = 0.9672 max
= 5
F(2,3998) =
48238.88
corr(u_i, Xb) = 0.7384 Prob > F = 0.0000
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf.
Interval]
-------------+----------------------------------------------------------------
x1
| 2.997751 .0164867 181.83 0.000 2.965428 3.030074
x2
| 2.993408 .0157054 190.60 0.000 2.962616 3.024199
_cons
| 2.942511 .0140139 209.97 0.000 2.915036 2.969986
-------------+----------------------------------------------------------------
sigma_u | 2.0660572
sigma_e | .99033698
rho | .81316439 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(999, 3998) = 9.81 Prob > F = 0.0000
Finally, the output below illustrates -xtreg, re- is not a consistent
estimator for the coefficients with this data generating process.
. xtreg y x1 x2, re i(id)
Random-effects GLS regression Number of obs = 5000
Group variable (i) : id Number of groups = 1000
R-sq: within = 0.9601 Obs per group: min = 5
between = 0.9885 avg
= 5.0
overall = 0.9674 max
= 5
Random effects u_i ~ Gaussian Wald chi2(2) = 109595.77
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
------------------------------------------------------------------------------
y | Coef. Std. Err. z P>|z| [95% Conf.
Interval]
-------------+----------------------------------------------------------------
x1
| 3.259649 .0197243 165.26 0.000 3.22099 3.298308
x2
| 3.363142 .0179413 187.45 0.000 3.327977 3.398306
_cons
| 2.958558 .0344846 85.79 0.000 2.89097 3.026147
-------------+----------------------------------------------------------------
sigma_u | .72744824
sigma_e | .99033698
rho | .35046296 (fraction of variance due to u_i)
------------------------------------------------------------------------------
There are many other points that could be made about the results presented
above. However, I hope that this simulations help to illustrate the basic
points that
i) -regress , cluster(id)-, -xtreg, re-, and -xtreg, fe- are all
consistent estimators for the coefficients with the random-effects
data generating process.
ii) Tests performed on the coefficients after -regress ,
cluster(id)-, -xtreg, re-, and -xtreg, fe- will have close to
nominal coverage when the data was generated by a random-effects
data generating process.
iii) -xtreg, re- and -xtreg, fe- produce more efficients estimates
that -regress, cluster(id)-
iv) If the covariates are correlated with the id level error
component, u_i, then only -xtreg, fe- produces consistent estimates.
--David
[email protected]
----------------- begin fevclust.do---------------------------------------
clear
capture log close
log using fevclust.log , replace
set seed 1234567
postfile redat x1_crg x2_crg x1_prg x2_prg x1_cfe x2_cfe x1_pfe x2_pfe /*
*/ x1_cre x2_cre x1_pre x2_pre using redat, replace double
forvalues i=1/1000 {
qui {
drop _all
set obs 100
gen ui=2*invnorm(uniform())
gen id =_n
expand 5
sort id
gen x1=invnorm(uniform())
gen x2=invnorm(uniform())+.3*x1
gen eit=invnorm(uniform())
gen y=3+3*x1+3*x2+ui+eit
regress y x1 x2,cluster(id)
scalar x1_crg = _b[x1]
scalar x2_crg = _b[x2]
test x1 = 3
scalar x1_prg = r(p)
test x2 = 3
scalar x2_prg = r(p)
xtreg y x1 x2, fe i(id)
scalar x1_cfe = _b[x1]
scalar x2_cfe = _b[x2]
test x1 = 3
scalar x1_pfe = r(p)
test x2 = 3
scalar x2_pfe = r(p)
xtreg y x1 x2, re i(id)
scalar x1_cre = _b[x1]
scalar x2_cre = _b[x2]
test x1 = 3
scalar x1_pre = r(p)
test x2 = 3
scalar x2_pre = r(p)
post redat (x1_crg) (x2_crg) (x1_prg) (x2_prg) (x1_cfe) /*
*/ (x2_cfe) (x1_pfe) (x2_pfe) (x1_cre) (x2_cre) /*
*/ (x1_pre) (x2_pre)
}
}
postclose redat
use redat, clear
gen x1_rjrg=(x1_prg<.05)
gen x2_rjrg=(x2_prg<.05)
gen x1_rjfe=(x1_pfe<.05)
gen x2_rjfe=(x2_pfe<.05)
gen x1_rjre=(x1_pre<.05)
gen x2_rjre=(x2_pre<.05)
sum
save redat, replace
capture log close
----------------- end fevclust.do---------------------------------------
----------------- begin fe_ex.do---------------------------------------
clear
set seed 1234567
drop _all
set obs 1000
gen ui=2*invnorm(uniform())
gen id =_n
expand 5
sort id
gen x1=invnorm(uniform())+.4*ui
gen x2=invnorm(uniform())+.3*x1 + .4*ui
gen eit=invnorm(uniform())
gen y=3+3*x1+3*x2+ui+eit
corr x1 x2 ui
regress y x1 x2,cluster(id)
xtreg y x1 x2, fe i(id)
xtreg y x1 x2, re i(id)
----------------- end fe_ex.do---------------------------------------
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
Daniel Simon
Assistant Professor
Department of Applied Economics and Management
Cornell University
(607) 255-1626
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/