Title | Interpreting coefficients when interactions are in your model | |
Author | Kenneth Higbee, StataCorp |
I will illustrate what is happening with a simple example using regress. We will explore the hypotheses being tested as we change the base (omitted) level when we have an interaction in a simple two-factor model. For this simple example, each factor has only two levels.
The key conclusion is that, despite what some may believe, the test of a single coefficient in a regression model when interactions are in the model depends on the choice of base levels. Changing from one base to another changes the hypothesis. Furthermore, the hypothesis for a test involving a single regression coefficient is generally not the same as the hypothesis tested by an ANOVA F test of the main effect of a factor. This may be counterintuitive at first glance, but it is true.
Take the following data:
. use http://www.stata.com/support/faqs/dta/anoregcoef.dta, clear . list, sepby(A B)
y A B | |
1. | 13 1 1 |
2. | 34 1 1 |
3. | 25 1 1 |
4. | 30 1 1 |
5. | 28 1 2 |
6. | 10 1 2 |
7. | 41 1 2 |
8. | 11 2 1 |
9. | 55 2 1 |
10. | 87 2 2 |
11. | 25 2 2 |
12. | 14 2 2 |
13. | 42 2 2 |
14. | 89 2 2 |
15. | 52 2 2 |
16. | 38 2 2 |
17. | 45 2 2 |
B | ||||
A | 1 2 | Total | ||
1 | 25.5 26.333333 | 25.857143 | ||
4 3 | 7 | |||
2 | 33 49 | 45.8 | ||
2 8 | 10 | |||
Total | 28 42.818182 | 37.588235 | ||
6 11 | 17 |
We have a 2 × 2 table with unbalanced data—that is, different sample sizes (4, 3, 2, and 8) in each cell. We will refer to the 2 × 2 table above and will compare its values and means to those in other regression tables. These comparisons can help us better understand what hypotheses are being tested.
Let’s start by thinking of the overparameterized design matrix X:
| A#B | | A | | B | | 1 1 2 2 | | | | 1 2 | | 1 2 | | 1 2 1 2 | | _cons | +-----+ +-----+ +---------+ +-------+ 1 0 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 0 1 X = 0 1 1 0 0 0 1 0 1 0 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 1
We want to compute regression coefficients b = inv(X'X)*(X'y), but because of the collinearities in X (A1 + A2 = _cons, B1 + B2 = _cons, ...), many of the columns of X must be omitted to have a matrix of full rank that we can invert.
Either the A1 or the A2 column needs to be omitted (or possibly the _cons, but let’s not explore that right now). The column we omit corresponds to what we call the base level for that factor. Likewise for B1 and B2—one of them must be omitted to avoid collinearity with the constant. Of the four columns of X for the A by B interaction, three of them must be omitted (given that we are keeping one of the A columns, one of the B columns, and _cons).
We could choose to omit the first level of both A and B (the A1 and B1 columns of X) and the columns corresponding to A#B that match up with those selections (in this case, the first 3 columns of the part of X for A#B).
. regress y b1.A b1.B A#B
The above command is equivalent to Stata’s default of picking the first level to be the base when you simply type
. regress y i.A i.B A#B
or even more succinctly,
. regress y A##B
In all cases of regress in this FAQ, add the allbaselevels option to get a more verbose regression table that indicates exactly which columns of the X matrix were omitted. After the concept is perfectly clear, you may choose not to use the allbaselevels option because it seems overly verbose.
Instead of choosing A at level 1 and B at level 1 for the base, we could make three other choices for base:
A at level 1, B at level 2
A at level 2, B at level 1
A at level 2, B at level 2
You can get these three other choices with these commands:
. regress y b1.A b2.B A#B . regress y b2.A b1.B A#B . regress y b2.A b2.B A#B
Run those four regressions, examine the coefficients, and compare them with the means shown in the table above.
Let’s start with the default base levels. Just to be clear on which columns are dropped from the X matrix we showed above, first type the command:
. regress y b1.A b1.B A#B, allbaselevels
Then for the sake of brevity here, we look at a condensed version of the same regression table.
. regress y b1.A b1.B A#B, noheader
y | Coefficient Std. err. t P>|t| [95% conf. interval] | |
2.A | 7.5 19.72162 0.38 0.710 -35.10597 50.10597 | |
2.B | .8333333 17.39283 0.05 0.963 -36.7416 38.40827 | |
A#B | ||
2 2 | 15.16667 25.03256 0.61 0.555 -38.9129 69.24623 | |
_cons | 25.5 11.38628 2.24 0.043 .9014315 50.09857 | |
The _cons coefficient, 25.5, corresponds to the mean of the A1,B1 cell in our 2 × 2 table. In other words, the constant in the regression corresponds to the cell in our 2 × 2 table for our chosen base levels (A at 1 and B at 1).
We get the mean of the A1,B2 cell in our 2 × 2 table, 26.33333, by adding the _cons coefficient to the 2.B coefficient (25.5 + 0.833333).
We get the mean of the A2,B1 cell in our 2 × 2 table, 33, by adding the _cons coefficient to the 2.A coefficient (25.5 + 7.5).
We get the mean of the A2,B2 cell in our 2 × 2 table, 49, by adding the _cons coefficient to the 2.A coefficient, the 2.B coefficient, and the 2.A#2.B coefficient (25.5 + 7.5 + 0.8333 + 15.1667).
Let’s focus on the 2.A coefficient, which equals 7.5. What does it correspond to? It corresponds to the A2,B1 cell minus the A1,B1 cell. Looking back at our 2 × 2 table, that would be 33 − 25.5. When you look at the test for that single regression coefficient, you are testing this hypothesis: with B set to 1, is there a difference between level 2 of A and level 1 of A?
Now pick one of the other three regressions that uses a different combination of bases for the two factors. We pick the last one.
Just to be sure you are clear on what has been omitted from the X matrix, type the command:
. regress y b2.A b2.B A#B, allbaselevels
Then for brevity, here is the same regression shown more compactly:
. regress y b2.A b2.B A#B, noheader
y | Coefficient Std. err. t P>|t| [95% conf. interval] | |
1.A | -22.66667 15.4171 -1.47 0.165 -55.97329 10.63995 | |
1.B | -16 18.00329 -0.89 0.390 -54.89375 22.89375 | |
A#B | ||
1 1 | 15.16667 25.03256 0.61 0.555 -38.9129 69.24623 | |
_cons | 49 8.051318 6.09 0.000 31.60619 66.39381 | |
Here the _cons coefficient, 49, equals the mean for the A2,B2 cell of our 2 × 2 table. This corresponds to our choice of level 2 as our base level for both A and B.
We get the mean of the A1,B2 cell, 26.3333, by adding the _cons coefficient to the 1.A coefficient, (49 + (-22.6667)).
We get the mean of the A2,B1 cell, 33, by adding the _cons coefficient to the 1.B coefficient, (49 + (-16)).
We get the mean of the A1,B1 cell, 25.5, by adding all four of the coefficients (49 + (-22.6667) + (-16) + 15.1667)
Let’s look closely at the 1.A coefficient, which is -22.6667. That coefficient corresponds to the A1,B2 cell minus the A2,B2 cell. From our 2 × 2 table, that would be 26.3333 − 49. When you look at the test for that single regression coefficient, you are testing the hypothesis: with B set to 2, is there a difference between level 1 of A and level 2 of A?
The hypothesis for the test of the 1.A coefficient in this model is not equivalent to the hypothesis for the test of the 2.A coefficient in the previous regression model. They are both testing A, but in the first case it is a test of A with B set to 1. In this second case, it is a test of A with B set to 2.
In the first test, the p-value was 0.710. In the second, the p-value is 0.165. These are very different p-values for this dataset, but this is not shocking because they are testing different hypotheses.
I could illustrate what the coefficients represent in the other two regressions (where we pick other combinations of the levels of A and B to be the base), but I will refrain because it would make a long FAQ even longer.
The ANOVA test of the main effect of A is a different test from both of the coefficient tests shown above.
. anova y A B A#B Number of obs = 17 R-squared = 0.2330 Root MSE = 22.7726 Adj R-squared = 0.0560
Source | Partial SS df MS F Prob > F | ||
Model | 2048.45098 3 682.816993 1.32 0.3112 | ||
A | 753.126437 1 753.126437 1.45 0.2496 | ||
B | 234.505747 1 234.505747 0.45 0.5131 | ||
A#B | 190.367816 1 190.367816 0.37 0.5550 | ||
Residual | 6741.66667 13 518.589744 | ||
Total | 8790.11765 16 549.382353 |
The test of the main effect of A gives a p-value of 0.2496. You get the same p-value for the main effect of A regardless of whether you type the anova command as shown above or pick different base levels. The following commands all give the same F tests:
. anova y b1.A b1.B A#B . anova y b1.A b2.B A#B . anova y b2.A b1.B A#B . anova y b2.A b2.B A#B
How would you get the ANOVA main-effect F test for term A from the underlying regression coefficients? Take a look at the symbolic option of test after anova.
. quietly anova y A B A#B . test A
Source | Partial SS df MS F Prob > F | ||
A | 753.126437 1 753.126437 1.45 0.2496 | ||
Residual | 6741.66667 13 518.589744 |
For each of the regressions, we can get the same F test for the main effect of A as shown by the ANOVA above. Type the following commands:
. regress y b1.A b1.B A#B . test _b[2.A] + 0.5*_b[2.A#2.B] = 0 . regress y b1.A##b2.B . test _b[2.A] + 0.5*_b[2.A#1.B] = 0 . regress y b2.A##b1.B . test _b[1.A] + 0.5*_b[1.A#2.B] = 0 . regress y b2.A##b2.B . test _b[1.A] + 0.5*_b[1.A#1.B] = 0
Refer back to the test A, symbolic table to see why the tests above are set up the way they are. If you are not sure how I knew to type _b[2.A#2.B] etc., use the coeflegend option of regress.
I admit that using the linear combination of regression coefficients _b[2.A] + 0.5*_b[2.A#2.B] (picking the first regression as an example) to produce the F test for term A’s main effect is not obvious or intuitive. Let’s look at the algebra when the first levels of A and B are the base levels for our regression:
2 x 2 cell = linear combination of coefficients |
A1,B1 = _b[_cons]
A1,B2 = _b[_cons] + _b[2.B] A2,B1 = _b[_cons] + _b[2.A] A2,B2 = _b[_cons] + _b[2.A] + _b[2.B] + _b[2.A#2.B] |
You find that 0.5*(A2,B1 + A2,B2) − 0.5*(A1,B1 + A1,B2) equals _b[2.A] + 0.5*_b[2.A#2.B].
The F test in ANOVA for the main effect of A is testing the following hypothesis: the average of the cell means when A is 2 − the average of the cell means when A is 1 = 0.
A similar demonstration could be shown for the other three regression models where other base levels were selected.