John Neumann <[email protected]> began an interesting thread on this list
yesterday when he asked whether he should use -reg ,cluster(id)-, -xtreg ,
re- or -xtreg, fe- for estimation and inference about his panel data model.
Then anirban basu <[email protected]> and Mark Schaffer
<[email protected]> both provided interesting responses to John's
original question.
Both Anirban and Mark have pointed out that -regress , cluster(id)- will
provide consistent estimates of the coefficients. There is also agreement
that -xtreg, re- will provide consistent estimates of the coefficients. But
there seems to some discussion of whether -xtreg, fe- will provide consistent
estimates.
In theory, all three estimators (-regress ,cluster(id)-, -xtreg,re- ,
-xtreg, fe-) are consistent estimators of the coefficients for
random-effects data generating processes. To flush the details, consider
the random-effects data generating process
y_it = X_it b + u_i + e_it
where X_it is a 1 x K vector of covariates, b is K x 1 vector of coefficients,
u_i is identically, independently distributed (iid) over id's,
e_it is iid over the observations, and there is no correlation between
X_it and u_i. Under these assumptions, all three estimators all provide
consistent estimators of the VCE matrix and the resulting Wald tests will
obtain nominal coverage, given enough data.
Another theoretical point is in order. For the random-effects data
generating process -xtreg, re- should provide more efficient estimates of
the coefficients than either of the other two. While -xtreg, fe- should
produce more efficient estimates than -regress, cluster(id)-. (One caveat in
this case is that inference is said to be conditional on the random-effects in
the sample.)
To illustrate these points, I have written a small simulation. The do file
is appended to the below my signature. Breifly the program
i) produces 1000 draws from a parameterization of the
random-effects data generating process
ii) runs the three estimators on each sample, saving off the
coefficients
iii) uses -test- to test that coefficients are equal to their true
values, saving off p-values
iv) then computes the coverage rates obtained by each estimator on
on each test
Let's begin looking at the results for the coefficients.
First, we need to understand the variable names. As can be seen from the
program fevclust.do, appended below,
Variable name Meaning
x1_crg -coefficient on x1 from -regress, cluster(id)-
x2_crg -coefficient on x2 from -regress, cluster(id)-
x1_cfe -coefficient on x1 from -xtreg, fe-
x2_cfe -coefficient on x2 from -xtreg, fe-
x1_cre -coefficient on x1 from -xtreg, fe-
x2_cre -coefficient on x2 from -xtreg, fe-
Now for these results. The table below presents the summary statistics from
these variables obtained over the 1000 samples that were generated.
Variable | Obs Mean Std. Dev. Min Max
-------------+-----------------------------------------------------
x1_crg | 1000 2.99791 .1070938 2.540678 3.372749
x2_crg | 1000 3.001715 .1044514 2.71244 3.303503
x1_cfe | 1000 3.000676 .051932 2.847759 3.144626
x2_cfe | 1000 3.003031 .0496754 2.845199 3.160065
x1_cre | 1000 3.000508 .0515863 2.846256 3.150228
x2_cre | 1000 3.00292 .0498531 2.839417 3.163176
There are several points to note. The mean of the estimates of each
estimator is very close to the true value of 3.0. Second the standard
deviation of the estimates from -regress, cluster(id)- is about twice the
standard deviations from the other two estimators. This indicates that
-xtreg,re- and -xtreg,fe- are more efficient that -regress, cluster(id)-.
Third, the standard deviation of the estimates from -xtreg, fe- are
surprisingly close to those of -xtreg, re-. This indicates that for this
parameterization of the data generating process and sample size, -xtreg, fe-
is as efficient an estimator as -xtreg, re-.
Now let's consider coverage. The results table below contains the means of
6 binary variables from the 1000 generated samples. In each sample, each
variable is 1 if the test in question was rejected for that sample and zero
otherwise. Thus the means in the table below can be interpreted as
emprical coverage rates.
x1_rjrg fraction of tests in which the true null that x1=3 was rejected
after -reg ,cluster(id)
x2_rjrg fraction of tests in which the true null that x2=3 was rejected
after -reg ,cluster(id)
x1_rjfe fraction of tests in which the true null that x1=3 was rejected
after -xtreg, fe-
x2_rjfe fraction of tests in which the true null that x2=3 was rejected
after -xtreg, re-
x1_rjre fraction of tests in which the true null that x1=3 was rejected
after -xtreg, re-
x2_rjre fraction of tests in which the true null that x2=3 was rejected
after -xtreg, fe-
And the results are
Variable | Obs Mean Std. Dev. Min Max
-------------+-----------------------------------------------------
x1_rjrg | 1000 .05 .218054 0 1
x2_rjrg | 1000 .07 .2552747 0 1
x1_rjfe | 1000 .054 .2261308 0 1
x2_rjfe | 1000 .039 .1936918 0 1
x1_rjre | 1000 .059 .2357426 0 1
x2_rjre | 1000 .042 .2006895 0 1
Note the all the tests are reasonable close to nominal coverage. Also note
that the tests after -xtreg, re- are marginally closer to nominal than those
after -xtreg, fe-.
There is one final point that must be made. The crutial assumption in the
above data generating process is that X_it and u_i are not correlated. If
they are correlated only -xtreg, fe- will provide consistent estimates.
Below -fevclust.do-, I have append a second program, called fe_ex.do, that
illustrates this point. -fe_ex.do- generates a single large sample from the
same structure as above, EXCEPT that there is correlation between X_it and
u_i.
Here are the crutial correlations in our sample
. corr x1 x2 ui
(obs=5000)
| x1 x2 ui
-------------+---------------------------
x1 | 1.0000
x2 | 0.6081 1.0000
ui | 0.6349 0.7170 1.0000
Since all true values of the coefficients are 3.0, the output below
illustrates, -regress, cluster(id)- is not consistent for this data
generating process.
. regress y x1 x2,cluster(id)
Regression with robust standard errors Number of obs = 5000
F( 2, 999) =34624.91
Prob > F = 0.0000
R-squared = 0.9674
Number of clusters (id) = 1000 Root MSE = 1.6459
------------------------------------------------------------------------------
| Robust
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x1 | 3.498473 .024788 141.14 0.000 3.449831 3.547116
x2 | 3.701595 .0232616 159.13 0.000 3.655948 3.747242
_cons | 2.97323 .0336344 88.40 0.000 2.907228 3.039232
------------------------------------------------------------------------------
In contrast, the output below illustrates -xtreg, fe- is a consistent
estimator for the coefficients with this data generating process.
. xtreg y x1 x2, fe i(id)
Fixed-effects (within) regression Number of obs = 5000
Group variable (i) : id Number of groups = 1000
R-sq: within = 0.9602 Obs per group: min = 5
between = 0.9885 avg = 5.0
overall = 0.9672 max = 5
F(2,3998) = 48238.88
corr(u_i, Xb) = 0.7384 Prob > F = 0.0000
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x1 | 2.997751 .0164867 181.83 0.000 2.965428 3.030074
x2 | 2.993408 .0157054 190.60 0.000 2.962616 3.024199
_cons | 2.942511 .0140139 209.97 0.000 2.915036 2.969986
-------------+----------------------------------------------------------------
sigma_u | 2.0660572
sigma_e | .99033698
rho | .81316439 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(999, 3998) = 9.81 Prob > F = 0.0000
Finally, the output below illustrates -xtreg, re- is not a consistent
estimator for the coefficients with this data generating process.
. xtreg y x1 x2, re i(id)
Random-effects GLS regression Number of obs = 5000
Group variable (i) : id Number of groups = 1000
R-sq: within = 0.9601 Obs per group: min = 5
between = 0.9885 avg = 5.0
overall = 0.9674 max = 5
Random effects u_i ~ Gaussian Wald chi2(2) = 109595.77
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
------------------------------------------------------------------------------
y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x1 | 3.259649 .0197243 165.26 0.000 3.22099 3.298308
x2 | 3.363142 .0179413 187.45 0.000 3.327977 3.398306
_cons | 2.958558 .0344846 85.79 0.000 2.89097 3.026147
-------------+----------------------------------------------------------------
sigma_u | .72744824
sigma_e | .99033698
rho | .35046296 (fraction of variance due to u_i)
------------------------------------------------------------------------------
There are many other points that could be made about the results presented
above. However, I hope that this simulations help to illustrate the basic
points that
i) -regress , cluster(id)-, -xtreg, re-, and -xtreg, fe- are all
consistent estimators for the coefficients with the random-effects
data generating process.
ii) Tests performed on the coefficients after -regress ,
cluster(id)-, -xtreg, re-, and -xtreg, fe- will have close to
nominal coverage when the data was generated by a random-effects
data generating process.
iii) -xtreg, re- and -xtreg, fe- produce more efficients estimates
that -regress, cluster(id)-
iv) If the covariates are correlated with the id level error
component, u_i, then only -xtreg, fe- produces consistent estimates.
--David
[email protected]
----------------- begin fevclust.do---------------------------------------
clear
capture log close
log using fevclust.log , replace
set seed 1234567
postfile redat x1_crg x2_crg x1_prg x2_prg x1_cfe x2_cfe x1_pfe x2_pfe /*
*/ x1_cre x2_cre x1_pre x2_pre using redat, replace double
forvalues i=1/1000 {
qui {
drop _all
set obs 100
gen ui=2*invnorm(uniform())
gen id =_n
expand 5
sort id
gen x1=invnorm(uniform())
gen x2=invnorm(uniform())+.3*x1
gen eit=invnorm(uniform())
gen y=3+3*x1+3*x2+ui+eit
regress y x1 x2,cluster(id)
scalar x1_crg = _b[x1]
scalar x2_crg = _b[x2]
test x1 = 3
scalar x1_prg = r(p)
test x2 = 3
scalar x2_prg = r(p)
xtreg y x1 x2, fe i(id)
scalar x1_cfe = _b[x1]
scalar x2_cfe = _b[x2]
test x1 = 3
scalar x1_pfe = r(p)
test x2 = 3
scalar x2_pfe = r(p)
xtreg y x1 x2, re i(id)
scalar x1_cre = _b[x1]
scalar x2_cre = _b[x2]
test x1 = 3
scalar x1_pre = r(p)
test x2 = 3
scalar x2_pre = r(p)
post redat (x1_crg) (x2_crg) (x1_prg) (x2_prg) (x1_cfe) /*
*/ (x2_cfe) (x1_pfe) (x2_pfe) (x1_cre) (x2_cre) /*
*/ (x1_pre) (x2_pre)
}
}
postclose redat
use redat, clear
gen x1_rjrg=(x1_prg<.05)
gen x2_rjrg=(x2_prg<.05)
gen x1_rjfe=(x1_pfe<.05)
gen x2_rjfe=(x2_pfe<.05)
gen x1_rjre=(x1_pre<.05)
gen x2_rjre=(x2_pre<.05)
sum
save redat, replace
capture log close
----------------- end fevclust.do---------------------------------------
----------------- begin fe_ex.do---------------------------------------
clear
set seed 1234567
drop _all
set obs 1000
gen ui=2*invnorm(uniform())
gen id =_n
expand 5
sort id
gen x1=invnorm(uniform())+.4*ui
gen x2=invnorm(uniform())+.3*x1 + .4*ui
gen eit=invnorm(uniform())
gen y=3+3*x1+3*x2+ui+eit
corr x1 x2 ui
regress y x1 x2,cluster(id)
xtreg y x1 x2, fe i(id)
xtreg y x1 x2, re i(id)
----------------- end fe_ex.do---------------------------------------
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/