Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Re-re-post: Stata 11 - Factor variables in a regression command
From
Michael Norman Mitchell <[email protected]>
Subject
Re: Re-re-post: Stata 11 - Factor variables in a regression command
Date
Sat, 01 May 2010 10:31:05 -0700
Greetings
Richard Williams wrote...
--- snip ---
As the original example shows, the fits produced by the first two
syntaxes are identical.
--- snip ---
I completely agree with Richard, that
. logistic y a#b
and
. logistic y a##b
both are two different ways of parameterizing a model with two
categorical predictors. If we let factor a have A levels, and factor b
have B levels, then both models will have
(A-1) + (B-1) + (A-1)*(B-1)
parameters in the model. In fact, this illustrates how the
parameters are decomposed in a traditional parameterization (i.a i.b
a#b), decomposing it into "main effect of a" (A-1 df), "main effect of
b" (B-1 df), and "a by b interaction" ( (A-1)*(B-1) df).
If, instead one specifies -a#b-, this term has (A-1) + (B-1) +
(A-1)*(B-1) , and is no longer partitioned into main effect of a, main
effect of b, and interaction. The omnibus test of this effect is the
overall test of the null hypothesis that there is simultaneously no
main effect of a, no main effect of b, and no a by b interaction. As I
show below, it simply tests the equality of means in all of the cells.
I think this is rarely of research interest when one has this kind of
"factorial" layout.
So, if this is what the omnibus test is doing, what about the
individual paramters. Looking at Ricardo's initial example
----------------------------------------------------------------------------
y | Odds Ratio Std. Err. z P>|z| [95% Conf. Int.]
-----------+----------------------------------------------------------------
a#b |
0 1 | 1.567419 .2804138 2.51 0.012 1.1038 2.2256
1 0 | 1.447424 .2588797 2.07 0.039 1.0194 2.0551
1 1 | 1.211988 .2246236 1.04 0.300 .84283 1.7428
----------------------------------------------------------------------------
Note how this is much like a "oneway" layout of the data, where there are four groups, and one of the groups is an omitted group (the group a=0 b=0 is the omitted group). So, each of these parameters is testing whether the "cell" differs from the omitted cell. That is, the first parameter tests whether the cell labeled a=0 b=1 differs from the cell a=0 b=0. It is as though the design had been converted into having four groups (labled 1 2 3 4, and group 1 is the omitted group corresponding to a=0 b=0). Then, the tests compare group 2 vs. 1, group 3 vs 1, and group 4 vs. 1. The omnibus test of all the parameters, as noted above, tests the equality of all of the cell means.
Returning to Richards point, as he notes this is just an alternative parameterization of the original model, now where each cell is compared to a reference cell. If this is the desired series of comparisons a researcher wants to make, this is a very useful and parameterization.
I hope that is useful to Ricardo, and any other readers,
Best regards,
Michael
Michael N. Mitchell
See the Stata tidbit of the week at...
http://www.MichaelNormanMitchell.com
On 2010-05-01 8.50 AM, Richard Williams wrote:
At 01:42 AM 5/1/2010, Michael Norman Mitchell wrote:
Dear Ricardo
The command
. logistic y a#b
includes just the interaction of "a by b", and does not include
the main effect of a, nor the main effect of b. By contrast, the
command
. logistic y a##b
includes the main effect of a, the main effect of b, as well as
the a by b interaction. It is equivalent to typing
. logistic y a#b a b
I don't think this is quite right. As the original example shows,
the fits produced by the first two syntaxes are identical. So, a#b
and a##b are different ways of parameterizing the models. a##b gives
you the main effect of a, the main effect of b, and the interaction,
i.e. it is the same as entering a, b, and a*b in the model. a*b = 1
if a and b both equal 1, 0 otherwise. I believe this is equivalent
to your 3rd syntax, except I would say i.a and i.b so Stata knows
these are categorical variables.
With a#b, there are four possible combinations of values: 0 0, 0 1, 1
0, and 1 1. The first gets dropped and the other three are in the
model.
These are two parameterizations of the same model; personally I
prefer the a##b approach because it separates main effects from
interaction effects.
The following example illustrates the 3 different approaches, and
shows the equivalence of the last 2 approaches in Michael's example:
. use "http://www.indiana.edu/~jslsoc/stata/spex_data/ordwarm2.dta",
clear
(77 & 89 General Social Survey)
. logit warmlt2 yr89#male, nolog
Logistic regression Number of obs
= 2293
LR chi2(3)
= 64.74
Prob > chi2
= 0.0000
Log likelihood = -851.54241 Pseudo R2
= 0.0366
------------------------------------------------------------------------------
warmlt2 | Coef. Std. Err. z P>|z| [95% Conf.
Interval]
-------------+----------------------------------------------------------------
yr89#male |
0 1 | .1816812 .1431068 1.27 0.204 -.098803
.4621655
1 0 | -1.295833 .229115 -5.66 0.000 -1.74489
-.8467762
1 1 | -.659902 .2022755 -3.26 0.001 -1.056355
-.2634493
|
_cons | -1.667376 .1021154 -16.33 0.000 -1.867518
-1.467233
------------------------------------------------------------------------------
. logit warmlt2 yr89##male, nolog
Logistic regression Number of obs
= 2293
LR chi2(3)
= 64.74
Prob > chi2
= 0.0000
Log likelihood = -851.54241 Pseudo R2
= 0.0366
------------------------------------------------------------------------------
warmlt2 | Coef. Std. Err. z P>|z| [95% Conf.
Interval]
-------------+----------------------------------------------------------------
1.yr89 | -1.295833 .229115 -5.66 0.000 -1.74489
-.8467762
1.male | .1816812 .1431068 1.27 0.204 -.098803
.4621655
|
yr89#male |
1 1 | .4542502 .3050139 1.49 0.136 -.1435661
1.052066
|
_cons | -1.667376 .1021154 -16.33 0.000 -1.867518
-1.467233
------------------------------------------------------------------------------
. logit warmlt2 i.yr89 i.male yr89#male, nolog
Logistic regression Number of obs
= 2293
LR chi2(3)
= 64.74
Prob > chi2
= 0.0000
Log likelihood = -851.54241 Pseudo R2
= 0.0366
------------------------------------------------------------------------------
warmlt2 | Coef. Std. Err. z P>|z| [95% Conf.
Interval]
-------------+----------------------------------------------------------------
1.yr89 | -1.295833 .229115 -5.66 0.000 -1.74489
-.8467762
1.male | .1816812 .1431068 1.27 0.204 -.098803
.4621655
|
yr89#male |
1 1 | .4542502 .3050139 1.49 0.136 -.1435661
1.052066
|
_cons | -1.667376 .1021154 -16.33 0.000 -1.867518
-1.467233
------------------------------------------------------------------------------
-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
OFFICE: (574)631-6668, (574)631-6463
HOME: (574)289-5227
EMAIL: [email protected]
WWW: http://www.nd.edu/~rwilliam
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/