Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Phil Schumm <pschumm@uchicago.edu> |
To | Statalist Statalist <statalist@hsphsun2.harvard.edu> |
Subject | Re: st: My ANOVA and regression results don't agree |
Date | Mon, 6 Jan 2014 20:10:17 -0600 |
On Jan 6, 2014, at 4:53 PM, Pepper, Jessica <jkadis@live.unc.edu> wrote: > Thanks for sending that link. I followed those instructions and got results that made sense. I just have 2 follow-up questions: > > 1. I understand that the 2 approaches (ANOVA and regress/test) don't correspond. When I follow the UCLA procedure that you sent the link to, it confirms what I initially found in the ANOVA and also shows me the contrasts, which is what I really need. All that is great. But why, in essence, should I "trust" the ANOVA over the regression? Why are the p values from the regression wrong? They're not wrong, just testing a different hypothesis given the way the covariates are coded. For example, consider the following simple example: . use http://www.stata-press.com/data/r13/systolic if inlist(drug,1,3) /// & inlist(disease,1,2) . anova systolic drug##disease Number of obs = 18 R-squared = 0.5706 Root MSE = 10.5015 Adj R-squared = 0.4786 Source | Partial SS df MS F Prob > F -------------+---------------------------------------------------- Model | 2052.05 3 684.016667 6.20 0.0067 | drug | 1429.39211 1 1429.39211 12.96 0.0029 disease | 178.35117 1 178.35117 1.62 0.2242 drug#disease | 123.918421 1 123.918421 1.12 0.3071 | Residual | 1543.95 14 110.282143 -------------+---------------------------------------------------- Total | 3596 17 211.529412 . reg systolic i.drug##i.disease, noheader ------------------------------------------------------------------------------ systolic | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- 3.drug | -13 7.425703 -1.75 0.102 -28.92655 2.92655 2.disease | -1.083333 6.778709 -0.16 0.875 -15.62222 13.45555 | drug#disease | 3 2 | -10.85 10.23563 -1.06 0.307 -32.80323 11.10323 | _cons | 29.33333 4.287232 6.84 0.000 20.13814 38.52853 ------------------------------------------------------------------------------ . test 3.drug + 3.drug#2.disease/2 = 0 ( 1) 3.drug + .5*3.drug#2.disease = 0 F( 1, 14) = 12.96 Prob > F = 0.0029 The main effect of drug in the regression corresponds to the difference between drugs 3 and 1 *among those with disease 1 only*. In contrast, the main effect of drug in the ANOVA corresponds to the overall (or marginal in the case of balanced data) difference between drugs 3 and 1. This is what people usually mean when they talk about a main effect. Of course, although in this case there is no evidence of an interaction between drug and disease, if there were, then the main effects might not be very meaningful. Note that you can also get the same main effect(s) after -regress- with -contrast-: . contrast g.drug, noeffects Contrasts of marginal linear predictions Margins : asbalanced ------------------------------------------------ | df F P>F -------------+---------------------------------- drug | (1 vs mean) | 1 12.96 0.0029 (3 vs mean) | 1 12.96 0.0029 Joint | 1 12.96 0.0029 | Denominator | 14 ------------------------------------------------ which is easier if your covariate(s) have many levels. > 2. The procedure on the UCLA site defaults to treating the highest level of the variable as the reference category. That doesn't matter for my two level variable, but it does for my 3 level variable, correct? And if so, is there an easy way to tell it to treat the lowest level as the reference category? Or should I just manually create a new variable that switches those levels. IIRC, the UCLA FAQ created the dummy variables manually (with the -generate- option to -tab-). An easier option (for Stata 11 and higher) is to use factor variables, as I have illustrated in the regression above. These make it easy to change the base category (e.g., using ib3.drug in my example above would have caused Stata to use 3 as the base (drug) category instead of 1). Of course, which category you use as the base doesn't affect the model -- only the way the coefficients are presented in the table. Post-estimation, you can always obtain whatever contrast(s) you want. > I hope these questions make sense. I am new to Stata and have never encountered a situation where ANOVA and regression don't agree. That's a misnomer -- they do agree. What differs are the coefficients that result from a model matrix representing deviations from the (balanced) grand mean (as used in ANOVA) versus those resulting from dummy variables. As demonstrated above, however, it's easy to switch between these after you've fit the model. The same would be true in R or SAS. -- Phil * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/