[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: on-line resources for multiple partial F-test?

From	[email protected] (William Gould, Stata)
To	[email protected]
Subject	Re: st: on-line resources for multiple partial F-test?
Date	Mon, 30 Jan 2006 11:24:28 -0600
Doug Mounce <[email protected]> asked about linear regression
with binary independent variables, 

> Any recommendations for a good on-line description and explanation
> for applying the multiple partial F-test?  Also, what's the general
> way to think about using regress when the Y variable is continuous,
> but the X is binary?
>
> We've been learning in class how to describe the regress of one  
> continuous variable on another, and I understand how to do a log  
> transform and look for curvature or heteroscedasticity.  We do  
> jackknife residuals and some other diagnostics, and I'm stumped on  
> how to describe the regress when the independent variable is binary.   
> I can talk about the regression coefficient, I guess, but the R^2  
> doesn't account-for much variation in this data.

Doug also said, "Sorry if asking for help with homework is bad form, ...".
It is never bad form to ask for understanding, so I'm going to rattle
on a bit.

One important thing to learn is that any table of means can be 
reproduced as a linear regression.  For instance, consider the 
following (oneway) table

                         avg.
         Sex      |  blood pressure
         ---------+-----------------
         Male     |       140
         Female   |       138

Are male and female avg. blood pressure different?  Linear regression can 
answer that question, and I will show how.

Or consider the following (twoway) table

                      untreated           treated
                         avg.              avg.
         Sex      |  blood pressure  blood pressure
         ---------+--------------------------------
         Male     |       170              160
         Female   |       165              158

Are the effects of treatment different for males and females?  Linear 
regression can answer that question. 

Or consider the following (many-way) table:

        presence of   | untreated    treatment 1    treatment 2
        complications |  avg bp        avg bp          avgbp
	--------------+----------------------------------------
        Males 40-50:  |
          Without     |    152       etc.
          With        |    167
        Males 51-60:  |
          Without     |    etc.
          With        |
        Males 60-70:  |
          Without     |
          With        |
                      |
        Females 40-50:|
          Without     |
          With        |
        Females 51-60:|
          Without     |
          With        |
        Females 60-70:|
          Without     |
          With        |
	-------------------------------------------------------

There are lots of questions we could ask (and answer) with linear 
regression.  Do complications matter?  If they do, do they matter 
for females?  Does age matter?  Maybe it only matters when there are 
complications?  Are treatment 1 and treatment 2 equivalent among males?
How about all of the above?

With linear regression, we can 

    1.  Fill in the tables.

    2.  Answer any of the above questions, and more.

    3.  Fill in constrained tables, tables that use the data efficiently
        under some assumption, such as that age only matters when 
        there are complications, and that complications only matter 
        among males, all at the same time.

In fact, once you get into this, I predict you will find that
linear-regression results are easier to understand and to interpret than the
tables, although you will still want to fill in the tables when presenting
results to nonprofessionals.  Everybody knows how to read a table.

In any case, there is a one-to-one correspondence between tables and
linear regressions.  They are different ways of writing down the
same thing.

Consider the following linear regression:

        bp = b0 + b1*female + noise

where bp is blood pressure, female is a variable that is 1 if the subject is
female and 0 otherwise, and b0 and b1 are coefficients to be estimated.
An important feature of linear regression is its relationship to means.

        E(bp) = E(b0 + b1*female + noise)     (E() is expectation operator)
              = E(b0) + E(b1*female) + E(noise)
              = b0 + b1*E(female)

E() is the expectation operator, and E(bp) means (is defined to be) the 
mean of blood pressure.  E(female) means (is defined to be) the mean
value of variable female, which is the proportion female in the population.

Let's take the equation 

        E(bp) = b0 + b1*E(female)

and use it to answer some questions.

What is the average blood pressure for males?  Answer:  In the case of males,
variable female is always equal to 0, therefore, the mean value of variable
female among males is 0.  Therefore, the average blood pressure is

        E(bp|male) = b0

What is the average blood pressure of females?  Answer:  In the case of 
females, variable females is always equal to 1.  Therefore, the mean value 
of variable female among females is 1.  Therefore, the average blood pressure
is

        E(bp|female) = b0 + b1

To wit, coefficient b1 measures the difference between female and male 
average blood pressures.

Are male and female average blood pressures the same?  That is merely a 
question of whether coefficient b1 is 0.  We can read that off the 
t-statistic reported by the regression, or we can type 

        . regress bp female
        . test female==0

Now let's add a treatment effect, 

        . regress bp female treatment

which runs the regression

        bp = b0 + b1*female + b2*treatment + noise 

playing the same expected-value (i.e., mean) game, we get the following:

                      untreated           treated
                         avg.              avg.
         Sex      |  blood pressure  blood pressure
         ---------+--------------------------------
         Male     |       b0               b0+b2
         Female   |     b0+b1            b0+b1+b2

If we look closely at the above table we will see that we constrained the
effect of treatment to be the same for males and females.  Subtract treated
from untreated in each row:

         males:   b0+b2 - b0          = b2
       females:   b0+b1+b2 - (b0+b1)  = b2

If we wanted to run a regression where the female treatment effect could be
different from the male treatment effect, we need to add a coefficient for
it:

        . gen treatedfem = treatment*female
        . regress bp female treatment treatedfem

which estimates the regression, 

        bp = b0 + b1*female + b2*treatment + b3*female*treated + noise 

and our mean table now is, 

                      untreated          treated
                         avg.              avg.
         Sex      |  blood pressure  blood pressure
         ---------+--------------------------------
         Male     |       b0               b0+b2
         Female   |     b0+b1            b0+b1+b2+b3

Let's use these results to answer the following questions:

    (1) Is the effect of treatment the same for males and females?  That is
        just a question of whether b3==0.

    (2) Does treatment have an effect among males?  It does not if b2==0.

    (3) Does treatment have an effect among females?  It does not if b2+b3==0.

    (4) Does treatment in general have an effect?  It does not if b2==0 AND
        b2+b3==0.  That turns out to have the same answer as b2==0 AND b3==0,
        which is interesting, but not substantively important.

Let's answer the four questions:

        . test treatedfem==0                        (1)

        . test treatment==0                         (2)

        . test treatment+treatedfem==0              (3)

        . test treatment==0                    
        . test treatment+treatedfem==0, accum       (4)

This last test is the multiple partial F test about which Doug asked.

Doug needs to put together some other tables in linear-regression form on his
own.  Eventually, he'll get to the point where where he can write down the
linear regression without thinking.


What about R^2?
---------------

Doug asked about R^2.  He mentioned that the R^2 did not account for much of
the variation in his data.

Who cares?

The R^2 is just a reflection of the variance of the noise term.
Let's go back to our simplest regression:

        bp = b0 + b1*female + noise

If the R^2 is small, then that means the the blood pressure of individual
patients exhibits substantial variation about the mean, but that does not
invalidate the mean.  Nor does it invalidate the tests performed on the mean.
They all take into account the variation.  More variation means that you will
need more data to uncover an effect, assuming it is present.

More variation means that, if you don't find an effect, that may be due to
insufficient data, but in linear regression, that's pretty easy to detect.
Are males the same as females?  In the above linear regression, that
translates to whether b1==0.  Say you cannot reject that b1 is equal to 0.
Look at your regression output.  Look at the 95% confidence interval reported
for coefficient b1.  The 95% C.I. will include 0, but I don't care about that.
Look at the lower and upper bounds.  Are they near 0, or are they large?  That
answers the ignorance question.  If the 95% confidence interval is [-48,52]
and we're talking about blood pressure, then you didn't measure the b1 effect
very precisely.  You really don't know whether b1 is 0.  If the 95% 
confidence interval is [-4, 3], then I'd say you've pretty well established 
that the difference between males and females is small.

That's one of the best features of regression.  Look at the coefficients 
and their CI's, and you can see what you measured and how well you measured 
it.  That's much more informative than just reporting a test result.

     ASIDE:  We were once interviewing a young graduate at StataCorp.
     He was evaluating the efficacy of remote versus in-class teaching.
     He had done a small survey.  He reported ANOVA results -- linear 
     regression, but without the coefficients.  The effect of remote 
     learning was insigificant at the 95% level, he reported gleefully.
     "What was the point estimate?" we asked, "what was the CI?"  He 
     didn't know.  Remember, his survey had few observations.  Therefore,
     we pointed out, you really don't know whether remote teaching is 
     equally effective, do you?  Maybe, we said, you're survey was too small
     to cast light on the subject.  In fact, we continued to press, 
     a back-of-the envelope calculation suggests that is precisely
     the case.  Devastating.  More devastating for his thesis advisor.


What about heteroscedasticity?
------------------------------

The equivalent of heteroscedasticity when all RHS variables are 0 or 1 is
heterogeneous variances for different groups.  When we estimate

        bp = b0 + b1*female + noise

we assume that E(noise^2)==constant, which is to say, the variance in 
blood pressure is the same for males and females.  What if it isn't?

Fact is, mean-equality tests are pretty robust to a violation of this 
assumption, so I'm pretty calm about it.  If you have a p-value of 
.001, variance inequality is not going to make the difference vanish.

When variances are different, you are using the data inefficiently.  Some
observations are more informative than others, and you can exploit that to get
better estimates, which can make the improved test move either way.  You can
estimate the variances for the subgroups by retrieving the residuals and
squaring them, and then you can use those estimates of variance to weight the
data.  I won't go into that here.  Point is, if you have to do that to get
your result, I'm pretty suspicious.  Also true, do that and your result
vanishes, and I'm equally suspicious.  Fact is, it rarely happens.

-- Bill
[email protected]
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Prev by Date: st: plotting stpiece results
Next by Date: st: wish list item - table
Previous by thread: st: on-line resources for multiple partial F-test?
Next by thread: st: arellano-bond
Index(es):
- Date
- Thread