Suppose we have two discrete variables (y and x) with each three
categories (coded 0, 1 and 2). This results in a 3x3 contingency tabel
with m=9 different cells. Now, it seems to me, that I have two
possibilities to incorporate x and y in regression like analysis:
1) Use four dummy variables, two for x and two for y (additional we can
use interaction terms)
2) Use one dummy for each of the m-1=8 cells of the contingency tabel.
My inclination would be to go with option 1. The problem with option 2 is
that it potentially confounds the effects of variables. Suppose, for
example, that the vars are race and religion, and that religion has
significant effects but race does not. Approach 1 can pick that up but in
approach 2 the effects of race and religion get muddled together. Or,
suppose that the main effects of race and religion are significant but the
interaction effects are not. With approach 1, you can run tests that will
show you the interaction effects should not be in there, but with approach
2, interaction and main effects again get muddled together. Even if all
effects are significant, approach 1 is more informative in that it
separates out the main effects and the interaction effects of the variables. RW