Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: about residuals and coefficients
From
David Hoaglin <[email protected]>
To
[email protected]
Subject
Re: st: about residuals and coefficients
Date
Wed, 18 Sep 2013 15:04:51 -0400
Dear Sam,
Your comments reflect a number of misunderstandings.
For any set of data, the phrasing "per unit increase" accurately
reflects the underlying mathematics. Thus, it cannot be a disservice.
As I have mentioned earlier, "other things constant" does not reflect
the way that multiple regression actually works.
Your example of regressing earnings on age and years of education is
puzzling. In cross-sectional data the comparison would be between
persons 30 years of age with 12 years of schooling and persons 30
years of age with 12 + 1 years of schooling. As a predictor variable,
however, education is usually expressed as a set of categories,
corresponding to the major steps in the education system. The
coefficients for those categories would be differences from the chosen
reference category, adjusted for the contribution of age in the data.
For the model
$ = b_0 + b_1 Yrs Ed + b_2 Age + e
the usual plot would put earnings on the z-axis, Yrs Ed on the x-axis,
and Age on the y-axis, and the fitted equation would describe a plane
(not two planes). You may be intersecting that plane with planes that
are perpendicular to the y-axis and the x-axis, respectively. That
picture does not alter the interpretation of b_1 and b_2.
In the model
$ = b_0 + b_1 Yrs Ed + b_2 Age + b_3 Age^2 + e
the coefficients b_0, b_1, and b_2 have different definitions from the
b_0, b_1, and b_2 in the previous model. The definition of each
coefficient in a multiple regression includes the set of other
predictors in the model. Now b_1 is the slope of Earnings against Yrs
Ed, after adjusting for the contributions of Age and Age^2. Thus, the
interpretation of b_1 in this model differs from the interpretation of
b_1 in the previous model.
The difference between "per unit change" and "per unit difference" is
only semantic. I said "per unit increase" because that is how slopes
are defined. The meaning should always be consistent with the context
of the data.
The geometric representations of those two models are sometimes
useful, but least-squares fitting in a multiple regression involves a
different geometry. If the data consist of n observations, y is a
vector in n-dimensional space, and the fitted regression is the
projection of the y-vector onto the subspace spanned by the constant
vector, the Yrs Ed vector, and the Age vector in the first model and
onto the subspace spanned by the constant vector, the Yrs Ed vector,
the Age vector, and the Age^2 vector in the second model.
David Hoaglin
On Wed, Sep 18, 2013 at 11:08 AM, Lucas <[email protected]> wrote:
> Dear David,
>
> This is why I do not understand why you prefer the "per unit increase"
> phrasing. Many (probably most) analyses use cross-sectional data.
> Thus, nothing is increasing or decreasing. The coefficients describe
> the relationships, but there is no reason to suspect -- just on the
> basis of cross-sectional data -- that change in an X will lead to the
> slope's change in Y.
>
> For example, if I regress earnings on yrs of education and age, that
> doesn't mean that a 30 year old with 12 years of schooling will be
> expected to increase their earnings by the increment of the slope for
> years of education by going to college for 1 year.
>
> It seems to me of the two potential disservices we can do to students,
> teaching them "per unit increase" is far more misleading than teaching
> them "other things constant" because at least the latter is an
> accurate representation of what the cross-sectional data can allow.
>
> Think about it like this. If my model is:
>
> $ = b_0 + b_1 Yrs Ed + b_2 Age + e
>
> then the model summarizes two planes. The plane for YrsEd has a
> constant slope, i.e., the slope of the plane for Yrs Ed does not vary
> regardless of where you are on the plane for Age. And, vice versa. If
> for theoretical, prior research, or other reasons I estimate:
>
> $ = b_0 + b_1 Yrs Ed + b_2 Age + b_3 Age^2 + e
>
> then the "plane" for Age has become a curved surface which means its
> slope varies for values of Age. Still, the slope for YrsEd is
> constant. So, the interpretation of the YrsEd slope seems unchanged.
> And so on.
>
> Of course, observational data does not usually fix the values of the
> independent variables, and experimenters can (up to a point). But
> there are other ways of addressing this than changing the
> interpretation so that it is either inaccurate or unduly confusing.
>
> Anyway, if we want to be as faithful as possible to what the data can
> say, we should avoid "per unit change" in favor of "per unit
> difference" because for cross-sectional data -- i.e., what is usually
> used -- change is obviously beyond the ability of the data to support.
>
> Other issues (e.g., being on the support vs. extrapolating off the
> support) obviously come in as well.
>
> Sam
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/faqs/resources/statalist-faq/
* http://www.ats.ucla.edu/stat/stata/