Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: My ANOVA and regression results don't agree


From   Phil Schumm <[email protected]>
To   Statalist Statalist <[email protected]>
Subject   Re: st: My ANOVA and regression results don't agree
Date   Mon, 6 Jan 2014 20:10:17 -0600

On Jan 6, 2014, at 4:53 PM, Pepper, Jessica <[email protected]> wrote:
> Thanks for sending that link. I followed those instructions and got results that made sense. I just have 2 follow-up questions:
> 
> 1. I understand that the 2 approaches (ANOVA and regress/test) don't correspond. When I follow the UCLA procedure that you sent the link to, it confirms what I initially found in the ANOVA and also shows me the contrasts, which is what I really need. All that is great. But why, in essence, should I "trust" the ANOVA over the regression? Why are the p values from the regression wrong?


They're not wrong, just testing a different hypothesis given the way the covariates are coded.  For example, consider the following simple example:


    . use http://www.stata-press.com/data/r13/systolic if inlist(drug,1,3) ///
        & inlist(disease,1,2)

    . anova systolic drug##disease

                       Number of obs =      18     R-squared     =  0.5706
                       Root MSE      = 10.5015     Adj R-squared =  0.4786

              Source |  Partial SS    df       MS           F     Prob > F
        -------------+----------------------------------------------------
               Model |     2052.05     3  684.016667       6.20     0.0067
                     |
                drug |  1429.39211     1  1429.39211      12.96     0.0029
             disease |   178.35117     1   178.35117       1.62     0.2242
        drug#disease |  123.918421     1  123.918421       1.12     0.3071
                     |
            Residual |     1543.95    14  110.282143   
        -------------+----------------------------------------------------
               Total |        3596    17  211.529412   

    . reg systolic i.drug##i.disease, noheader
    ------------------------------------------------------------------------------
        systolic |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
          3.drug |        -13   7.425703    -1.75   0.102    -28.92655     2.92655
       2.disease |  -1.083333   6.778709    -0.16   0.875    -15.62222    13.45555
                 |
    drug#disease |
            3 2  |     -10.85   10.23563    -1.06   0.307    -32.80323    11.10323
                 |
           _cons |   29.33333   4.287232     6.84   0.000     20.13814    38.52853
    ------------------------------------------------------------------------------

    . test 3.drug + 3.drug#2.disease/2 = 0

     ( 1)  3.drug + .5*3.drug#2.disease = 0

           F(  1,    14) =   12.96
                Prob > F =    0.0029


The main effect of drug in the regression corresponds to the difference between drugs 3 and 1 *among those with disease 1 only*.  In contrast, the main effect of drug in the ANOVA corresponds to the overall (or marginal in the case of balanced data) difference between drugs 3 and 1.  This is what people usually mean when they talk about a main effect.  Of course, although in this case there is no evidence of an interaction between drug and disease, if there were, then the main effects might not be very meaningful.

Note that you can also get the same main effect(s) after -regress- with -contrast-:


    . contrast g.drug, noeffects

    Contrasts of marginal linear predictions

    Margins      : asbalanced

    ------------------------------------------------
                 |         df           F        P>F
    -------------+----------------------------------
            drug |
    (1 vs mean)  |          1       12.96     0.0029
    (3 vs mean)  |          1       12.96     0.0029
          Joint  |          1       12.96     0.0029
                 |
     Denominator |         14
    ------------------------------------------------


which is easier if your covariate(s) have many levels.


> 2. The procedure on the UCLA site defaults to treating the highest level of the variable as the reference category. That doesn't matter for my two level variable, but it does for my 3 level variable, correct? And if so, is there an easy way to tell it to treat the lowest level as the reference category? Or should I just manually create a new variable that switches those levels. 


IIRC, the UCLA FAQ created the dummy variables manually (with the -generate- option to -tab-).  An easier option (for Stata 11 and higher) is to use factor variables, as I have illustrated in the regression above.  These make it easy to change the base category (e.g., using ib3.drug in my example above would have caused Stata to use 3 as the base (drug) category instead of 1).  Of course, which category you use as the base doesn't affect the model -- only the way the coefficients are presented in the table.  Post-estimation, you can always obtain whatever contrast(s) you want.


> I hope these questions make sense. I am new to Stata and have never encountered a situation where ANOVA and regression don't agree.


That's a misnomer -- they do agree.  What differs are the coefficients that result from a model matrix representing deviations from the (balanced) grand mean (as used in ANOVA) versus those resulting from dummy variables.  As demonstrated above, however, it's easy to switch between these after you've fit the model.  The same would be true in R or SAS.


-- Phil


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index