Title | Between estimators | |
Author | William Gould, StataCorp |
I understand the basic differences between a fixed-effects and a random-effects model for a panel dataset, but what is the “between estimator”? The manual explains the command, but I cannot figure out what would lead one to choose (or not choose) the between estimator.
Let’s start off by explaining cross-sectional time-series models.
One usually writes cross-sectional time-series models as
y_it = X_it*b + (complicated error term)
and explores different ways of writing the complicated error term.
In this discussion, I want to avoid that, so let’s just write
y_it = X_it*b + noise
I want to put the issue of noise aside and focus on b because, it turns out, b has some surprising meanings. Let’s focus on one of the X variables, say, x1:
y_it = x1_it*b1 + x2_it*b2 + ... + noise
b1 says that an increase in x1 of one unit leads me to expect, in all cases, that y will increase by b1.
The emphasis here is on “in all cases”: I expect the same difference in y if
Some variables might act like that, but there is no reason to expect that all variables will.
For example, pretend that y is income and x1 is “lives in the South of the United States”:
There are really two kinds of information in cross-sectional time-series data:
xtreg, be estimates using the cross-sectional information in the data.
xtreg, fe estimates using the time-series information in the data.
The random-effects estimator, it turns out, is a matrix-weighted average of those two results. Under the assumption that b1 really does have the same effect in the cross-section as in the time-series—and that b2, b3, ... work the same way—we can pool these two sources of information to get a more efficient estimator.
Now let’s discuss testing the random-effects assumption. Indeed, it is the expected equality of these two estimators under the equality-of-effects assumption that leads to the application of the Hausman test to test the random-effects assumption. In the Hausman test, one tests that
b_estimated_by_RE == b_estimated_by_FE
As I just said, it is true that
b_estimated_by_RE = Average(b_estimated_by_BE, b_estimated_FE)
and so the Hausman test is a test that
Average(b_estimated_by_BE, b_estimated_FE) == b_estimated_by_FE
or equivalently, that
b_estimated_by_BE == b_estimated_by_FE
I am being loose with my math here but there is, in fact, literature forming the test in this way and, if you forced me to go back and fill in all the details, we would discover that the Hausman test is asymptotically equivalent to testing
b_estimated_by_BE=b_estimated_by_FEby more conventional means, but that is not important right now.
What is important is that the Hausman test is cast in terms of efficiency, whereas thinking about
b_estimated_by_BE == b_estimated_by_FE
has recast the problem in terms of something real:
It may turn out that the answers to those two questions are the same (random effects), and it may turn out that they are different.
More general tests exist, as well.
If they are different, does that really mean the random-effects assumption is invalid? No, it just means the silly random-effects model that constrains all betas between person and within person to be equal is invalid. Within a random-effects model, there is nothing stopping me from saying that, for b1, the effects are different:
In the above model, _b[avgx1] measures the effect within the cross-section, and _b[deltax1] the effect within person. The model constrains the other effects to have the same effect within the cross-section and within-person (the random-effects model).
In fact, if I make this decomposition for every variable in the model:
The coefficients I obtain will equal the coefficients that would be estimated separately by xtreg, be and xtreg, fe.
Moreover, I now have the equivalent of the Hausman specification test, but recast with different words. My test amounts to testing that the cross-sectional effects equal the within-person effects:
Not only do I like these words better, but this test, I believe, is a better test than the Hausman test for testing random effects, because the Hausman test depends more on asymptotics. In particular, the Hausman test depends on the difference between two separately estimated covariance matrices being positive definite, something they just have to be, asymptotically speaking, under the assumptions of the test. In practice, the difference is sometimes not positive definite, and then we have to discuss how to interpret that result.
In the above test, however, that problem simply cannot arise.
This test has another advantage, too. I do not have to say that the cross-sectional and within-person effects are the same for all variables. I may very well need to include x1=“lives in the South” in my model—knowing that it is an important confounder and knowing that its effects are not the same in the cross-section as in the time-series—but my real interest is in the other variables. What I want to know is whether the data cast doubt on the assumption that the other variables have the same cross-sectional and time-series effects. So, I can test
and simply omit x1 from the test.
Why, then, does Stata include xtreg, be?
One answer is that it is a necessary ingredient in calculating random-effects results: the random-effects results are a weighted average of the xtreg, be and the xtreg, fe results.
Another is that it is important in and of itself if you are willing to think a little differently from most people about cross-sectional time-series models.
xtreg, be answers the question about the effect of x when x changes between person. This can usefully be compared with the results of xtreg, fe, which answers the question about the effect of x when x changes within person.
Thinking about and discussing the between and within models is an alternative to discussing the structure of the residuals. I must say that I lose interest rapidly when researchers report that they can make important predictions about unobservables. My interest is piqued when researchers report something that I can almost feel, touch, and see.