One specific comment and then some general comments. Note, these are just my
opinions on matters over which reasonable researchers might differ, so they're
offered in the spirit of opening rather than shutting down discussion. Having
said that, let me note further that my comments are critical of the approach
described by Buzz Burhans.
Specifically, in response to the question of the appropriateness of using a
correlation matrix with dummy variables, don't. Personally, I think the common
practice of examining correlation matrices to solve collinearity problems,
regardless of how variables are measured, is ill advised. If the collinearity
were simple enough to be revealed by a correlation matrix, then it wouldn't take
a correlation matrix to find it. More to the point, problems of collinearity
are often so complex that a correlation matrix will obscure as much as it
reveals.
Now, more generally, the problem of collinearity is one of estimation, so it
would be nice if a few tools and rules of thumb could get us around it.
Unfortunately, this is not the case. Just as we shouldn't be tempted to use
stepwise methods to formulate regression models, I don't think we should rely on
automated processes for diagnosing and solving problems of collinearity. Buzz
Burhans has indicated that "theoretical plausibility" is one of the criteria he
used. Aside from the estimated coefficients and standard errors (or CIs), which
alert us to the existence of the problem, I submit this is the only criterion
that should be used. (I assume that whatever procedure is followed, when dummy
variables are involved they are excluded in whole sets corresponding to the
original variables and not discarded willy-nilly.)
Problems with collinearity should be readily apparent from the behavior of the
estimated standard errors and/or coefficients. Short of collecting more data,
which is often the best solution, solving the problem is much more difficult
than identifying it. I say it's difficult because I assume every variable
included has a theoretical reason for being included and the researcher is
rarely justified in discarding relevant variables. However, when collinearity
is severe enough that we can't estimate a model, then we have to make some
compromises. This is where I believe it is the researcher's responsibility to
reconsider or rethink the theory that led to the model. Are the variables
included truly distinct factors or is there redundancy when variables are
combined?
As an aside, which means I'm not necessarily talking about Buzz Burhans'
situation, it's been my experience that far too many "problems" are blamed on
collinearity. A parameter estimate with a large variance is not by itself a
symptom of collinearity, for example. More often than not, it indicates an
irrelevant variable has been included in the analysis -- a theoretical problem
rather than a collinearity problem. In general, misspecification errors are far
more common than collinearity problems and should be ruled out before suspecting
collinearity.
Dave Moore
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]]On Behalf Of Buzz Burhans
> Sent: Thursday, June 12, 2003 11:07 AM
> To: [email protected]
> Subject: st: collinear categorical variable identification
>
>
> Dear Stata listers, especially epidemiologists,
>
> I have a question related identification and removal of collinear
> categorical variables My question(s) are about use of coldiag or other
> methods to identify collinear dichotomous variables for logistic regression.
>
> I have replaced the nominal and ordinal independent variables with
> dichotomous indicator variables. The dataset contains a fairly large
> number of factors which are collinear independent variables, and I am
> uncertain of the best way to identify and eliminate collinearity in the
> case of categorical variables.
>
> I have used coldiag, with a cut off of 30, accompanied by theoretical
> plausibiity, to identify candidates for removal from independent variables
> due to collinearity. However, I find that there is still some instability
> in the model, indicated by large CIs for the odds ratios. I then looked at
> the correlation matrix for the regressors, and using a combination of
> identification by a lower singular value (10), and by what is suggested by
> the independent variable correlations, I identify candidates for further
> removal. In making the identification I consider the correlation of two
> independent variables (if > 0.35, strong consideration for removal) and the
> contribution to variance decomposition (> 0.5 sugests removal), and the
> strength of the correlation to the dependant variable ( the stronger of two
> variables suggests it should be retained when there are competing
> candidates), and theoretical plausibility.
>
> My understanding is that using the correlation matrix when the regressor
> matrix includes dichotomous variables is not appropriate. However, the
> models are stable and sensible, and improved (more stable) from my earlier
> runs when I used simply the coldiag and a higher condition number as a
> cutoff. when I go back and tabulate the competing variables the in 2 way
> tables my decisions seem to be reasonable. When they are two dichotomous
> variables the odds ratios seem to support the decisions, and visual
> inspection of the tables for categorical variables with several categories
> seem consistent with the decisions70740buz
> for retention or exclusion I made. ( the dataset is relatively small, and
> there are not uncommonly empty cells in the twoway tabulations.
>
>
> Can you comment on my strategy, in particular on the appropriateness of
> coldiag approach in this case, and on the appropriateness of using a
> correlation assessment for categorical variables? Can you suggest a better
> strategy?
>
> Thanks very much for any help you can offer.
>
>
>
> Buzz Burhans
> [email protected]
>
>
>
> *
> * For searches and help try:
> * http://www.stata.com/support/faqs/res/findit.html
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
>
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/