Thanks for this thoughtful reply. My problem is a little different. In
my problem, I have some continuous (maybe 'normal') variables, some
dichotomous variables, and some categorical variables. It looks like mi
impute will allow me to impute the normal variables and all others, but
when I want to impute the categorical variables it looks as if I will
re-impute the normal ones as categories. I will likely need to continue
to use ICE.
BTW, I've just finished a study on variable selection with missing
values. I imputed using ICE and then did a stepwise procedure. It
worked very well and no matter which selection method I used, almost the
same variables were selected. I used lars, stepwise regression,
stepwise ordered logistic regression.
Manuscript is in final revision process, so not available for
distribution
Tony
Peter A. Lachenbruch
Department of Public Health
Oregon State University
Corvallis, OR 97330
Phone: 541-737-3832
FAX: 541-737-4001
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Yulia
Marchenko, StataCorp LP
Sent: Monday, July 27, 2009 2:32 PM
To: [email protected]
Subject: Re: st: Stata 11 imputation
Fred Wolfe <[email protected]> asks about imputing
multiple categorical variables using -mi impute mvn- available
as of Stata 11:
> I wonder if it might be possible in a revision of the manual to
> actually describe how to impute categorical values without having to
> purchase Allison's book (available on Amazon.com at a reasonable
> cost). There are a lot of "simple" examples in the manual. but no
> complex examples - somethings that would be helpful.
Before I answer Fred's specific questions, let me note that imputing
multiple
categorical variables is a difficult task in general. Currently, there
is no
definitive recommendation in the literature to what imputation method
should
be used to perform this task.
Multivariate normal imputation is not designed for imputing multiple
categorical variables. However, Allison (2000, 40) suggests an ad hoc
way of
how this can be done. One can use a dummy representation of categorical
variables to impute the corresponding indicator variables. For example,
if a
variable contains three categories, one will impute two indicator
variables,
corresponding to two categories, and then will compute the third
indicator
variable, corresponding to the reference category, as one minus the sum
of the
two imputed indicator variables. The imputed indicator variables will
contain
values on a continuous scale. To convert them to the binary metric, you
assign 1 to an indicator variable with the largest value and 0 to the
other
indicator variables. More simulation is needed to evaluate the
performance of
this method in practice.
Allison (2000) also notes that the analysis using imputed values without
rounding is superior to that which uses rounded imputed values (as
described
above). Our simulations displayed similar behavior in the case of
binary
predictors.
However, if a binary or categorical _dependent_ variable is being
imputed
using a regression-based method, rounding is unavoidable.
> Would it be possible for StataCorp people to indicate on the list the
> advantages of their multivariate method compared with Royston's.
-mi impute mvn- implements a method for imputing multivariate continuous
data
based on Schafer (1997), which is an extension of the theoretical work
by Li
(1988). This method is commonly referred to as NORM. NORM assumes a
joint
multivariate normal distribution and uses data augmentation (an
iterative MCMC
procedure) to simulate a predictive distribution from which imputed
values are
drawn.
Patrick Royston's -ice- command implements imputation via chained
equations
(ICE). ICE uses Gibbs sampling, another MCMC procedure, to obtain
imputed
values. ICE, however, does not assume a joint multivariate model.
Instead,
it uses a set of univariate full conditional specifications. In
general,
these do not always lead to a proper multivariate distribution.
The main advantage of NORM is a theoretical one -- the convergence of
the
method to a proper posterior distribution is theoretically justified.
Theoretical justification for the chained equation approach in general
is not
as well developed in literature, mainly because the chained-equation
approach
is not always supported by a proper underlying multivariate model; see,
for
example, van Buuren (2007).
The main advantage of ICE is that it is more flexible than NORM and can
more
directly handle non-continuous data. However, as mentioned above
convergence
to a proper multivariate distribution can be an issue.
Under the assumption of normality, ICE corresponds to a pure Gibbs
sampling
procedure and is equivalent to NORM. The two procedures performed
comparably
in our simulation. More simulation is needed, however, to compare the
two
methods for imputing binary or categorical data.
References:
Allison, P. D. 2001. Missing Data. Thousand Oaks, CA: Sage.
Li, K.-H. 1988. Imputation using Markov chains. Journal of Statistical
Computation and Simulation 30: 57--79.
Schafer, J. L. 1997. Analysis of Incomplete Multivariate Data. Boca
Raton,
FL: Chapman & Hall/CRC.
van Buuren, S. 2007. Multiple imputation of discrete and continuous data
by
fully conditional specification. Statistical Methods in Medical Research
16:
219--242.
-- Yulia
[email protected]
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/