Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: My ANOVA and regression results don't agree
From
Phil Schumm <[email protected]>
To
Statalist Statalist <[email protected]>
Subject
Re: st: My ANOVA and regression results don't agree
Date
Mon, 6 Jan 2014 20:10:17 -0600
On Jan 6, 2014, at 4:53 PM, Pepper, Jessica <[email protected]> wrote:
> Thanks for sending that link. I followed those instructions and got results that made sense. I just have 2 follow-up questions:
>
> 1. I understand that the 2 approaches (ANOVA and regress/test) don't correspond. When I follow the UCLA procedure that you sent the link to, it confirms what I initially found in the ANOVA and also shows me the contrasts, which is what I really need. All that is great. But why, in essence, should I "trust" the ANOVA over the regression? Why are the p values from the regression wrong?
They're not wrong, just testing a different hypothesis given the way the covariates are coded. For example, consider the following simple example:
. use http://www.stata-press.com/data/r13/systolic if inlist(drug,1,3) ///
& inlist(disease,1,2)
. anova systolic drug##disease
Number of obs = 18 R-squared = 0.5706
Root MSE = 10.5015 Adj R-squared = 0.4786
Source | Partial SS df MS F Prob > F
-------------+----------------------------------------------------
Model | 2052.05 3 684.016667 6.20 0.0067
|
drug | 1429.39211 1 1429.39211 12.96 0.0029
disease | 178.35117 1 178.35117 1.62 0.2242
drug#disease | 123.918421 1 123.918421 1.12 0.3071
|
Residual | 1543.95 14 110.282143
-------------+----------------------------------------------------
Total | 3596 17 211.529412
. reg systolic i.drug##i.disease, noheader
------------------------------------------------------------------------------
systolic | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
3.drug | -13 7.425703 -1.75 0.102 -28.92655 2.92655
2.disease | -1.083333 6.778709 -0.16 0.875 -15.62222 13.45555
|
drug#disease |
3 2 | -10.85 10.23563 -1.06 0.307 -32.80323 11.10323
|
_cons | 29.33333 4.287232 6.84 0.000 20.13814 38.52853
------------------------------------------------------------------------------
. test 3.drug + 3.drug#2.disease/2 = 0
( 1) 3.drug + .5*3.drug#2.disease = 0
F( 1, 14) = 12.96
Prob > F = 0.0029
The main effect of drug in the regression corresponds to the difference between drugs 3 and 1 *among those with disease 1 only*. In contrast, the main effect of drug in the ANOVA corresponds to the overall (or marginal in the case of balanced data) difference between drugs 3 and 1. This is what people usually mean when they talk about a main effect. Of course, although in this case there is no evidence of an interaction between drug and disease, if there were, then the main effects might not be very meaningful.
Note that you can also get the same main effect(s) after -regress- with -contrast-:
. contrast g.drug, noeffects
Contrasts of marginal linear predictions
Margins : asbalanced
------------------------------------------------
| df F P>F
-------------+----------------------------------
drug |
(1 vs mean) | 1 12.96 0.0029
(3 vs mean) | 1 12.96 0.0029
Joint | 1 12.96 0.0029
|
Denominator | 14
------------------------------------------------
which is easier if your covariate(s) have many levels.
> 2. The procedure on the UCLA site defaults to treating the highest level of the variable as the reference category. That doesn't matter for my two level variable, but it does for my 3 level variable, correct? And if so, is there an easy way to tell it to treat the lowest level as the reference category? Or should I just manually create a new variable that switches those levels.
IIRC, the UCLA FAQ created the dummy variables manually (with the -generate- option to -tab-). An easier option (for Stata 11 and higher) is to use factor variables, as I have illustrated in the regression above. These make it easy to change the base category (e.g., using ib3.drug in my example above would have caused Stata to use 3 as the base (drug) category instead of 1). Of course, which category you use as the base doesn't affect the model -- only the way the coefficients are presented in the table. Post-estimation, you can always obtain whatever contrast(s) you want.
> I hope these questions make sense. I am new to Stata and have never encountered a situation where ANOVA and regression don't agree.
That's a misnomer -- they do agree. What differs are the coefficients that result from a model matrix representing deviations from the (balanced) grand mean (as used in ANOVA) versus those resulting from dummy variables. As demonstrated above, however, it's easy to switch between these after you've fit the model. The same would be true in R or SAS.
-- Phil
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/