--- "Daniel Waxman" <dan@a...> wrote:
> I need to do a relatively simple imputation, but am having trouble
> following the examples given.
> Here is the situation:
>
> Dataset ~ 10,000 obs (non-weighted, 1 obs/subject)
>
> Variable to be imputed:
> EKG_abnormal --binary(yes/no), missing at random < 5% of
> observations.
>
> Potential predictors with which to impute:
> At least five, some binary (e.g. chestpain yes/no, first_cat (1-5),
> etc.)
> some which are continuous but can be made categorical (e.g. age ==>
> age_cat)
>
> Primary outcome being studied: Death yes/no
>
> The questions:
> (1) Should I use the outcome variable (death) as one of imputation
> variables? Should I use many imputation variables since I can
> (large dataset?
>
> (2) Most important: Can somebody give an example for the correct
> way to issue the commands?
>
> If I do the following:
>
> . hotdeck ekg_abnormal using imp, by(agecat first_cat) store
> keep(merge_variable) impute(5)
>
> Then I end up with 5 files, imp1 imp2 imp3 imp4 imp5
> Eventually I want to end up with imputed values for ekg_abnormal
> that I can use the main logistic regression equation of interest.
> Not sure where the options infile(), command(logit) fit into things.
The two questions are related: -hotdeck- produces multiple files
because it does the Multiple Imputation variant of hotdeck
imputation, and because it does multiple imputation you should also
include your dependent variable. You should include the dependent
variable since if you don't the missing values are imputed assuming
that there is no relation between ekg_abnormal and death. So the
relationship between these two variables estimated using the imputed
datasets will be underestimated.
Adding more variables in the imputation makes the MAR assumption more
likely, but increases the probability that some of the cells are very
sparce. Empty or nearly empty cells should be avoided in hotdeck
imputation. So you should add variables that are strongly related
with the imputed variable, and you should add as many as possible
without creating sparce cells.
The idea behind Multiple Imputation is as follows: If you just impute
ones you assume that you are as sure about the imputed values as you
are about the observed values. So, if you impute ones you
underestimate the standard error, i.e. you think you are more sure
about the parameter than you realy are. However, the observed cases
in each cell also give information about the distribution of likely
values of the missing observations (under the MAR assumption). You
can for each missing value draw at random a number of values, e.g. 5,
from this distribution, and thus create 5 completed datasets. These
are the completed datasets you got from the -hotdeck- command. You
can now estimate the model of interest for each completed dataset.
The variation in estimates between completed datasets is a measure of
the added uncertainty due to using imputed values. The procedure used
by -hotdeck- is described in: (Rubin 1987, p. 122-124), or (Allison
2002, p. 57-58).
The command you could use is:
hotdeck ekg_abnormal, by(chestpain firstcat agecat death) command
(logit death chestpain age firstcat ekg_abnormal) parms(chestpain age
firstcat ekg_abnormal _cons) impute(5)
This will generate the datasets, estimates the model of interest (the
model specified in the command-option), and combines the results
(those put in the parms-option) for you.
Hope this helps,
Maarten
Donald Rubin (1987) "Multiple Imputation for Nonresponse in Surveys",
New York: Wiley.
Paul Allison (2002) "Missing Data", Thousand Oaks: Sage.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/