Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Use of collapse (sum) in Multiple Imputation
From
[email protected] (Yulia Marchenko, StataCorp LP)
To
[email protected]
Subject
Re: st: Use of collapse (sum) in Multiple Imputation
Date
Wed, 12 Oct 2011 16:37:30 -0500
Alberto Zezza <[email protected]> asks how to obtain household-level data
which can be analyzed using -mi- from multiply-imputed individual-level data:
> I have a dataset with both individual and family level variables.
> Individuals are uniquely identified by a variable pid, households by a
> variable hhid.
>
> I have missing data for some individuals in an individual level variable x
> which I would like to impute, before summing it up over individuals within a
> household to obtain a household level variable to use in further analysis.
>
> Is there a way to do that and carry on the analysis within the mi
> environment in the household level file?
Alberto then provides code where he uses -collapse- with -mi xeq- to obtain
such a dataset, but receives an error:
> I am currently doing the following, using an individual level data file:
>
> mi set wide
> mi register imputed x
> mi register regular y z
> mi impute regress x y z, add(20)
> mi xeq: sort hhid; collapse (sum) x _*, by (hhid)
>
> but the command stops with an error when performing the collapse for m=M (20
> in mi case) saying
>
> variable _mi_id does not uniquely identify observations in the master data
> r(459);
>
> The variable _mi_id is not 'visible' in my list of variables so I presume this
> si something Stata generates in the background to manage mi data.
The error Alberto receives is because the -collapse- command should not be
used with -mi xeq-. -collapse- substantially modifies the current data
similarly to -append-, -merge-, -reshape-, etc. and thus should not be allowed
with -mi xeq-. We will modify -mi xeq- to issue an appropriate error message
when -collapse- is used.
Unlike such commands as -append- and -merge-, the -collapse- command does not
have an -mi- analog, e.g. -mi collapse-. However, we can do what Alberto
wants manually.
Before I proceed with an example, let's first agree on the definition of an
incomplete observation in the aggregated (household-level) data. The
distinction between complete and incomplete observations is important for the
-mi- command. So, we will consider an aggregate observation to be incomplete
if there is at least one missing observation among the individual observations
used to obtain the aggregate observation.
Using Alberto's example, we can obtain household-level data as follows. After
the imputation step, we perform:
// create household-level sums
. mi convert flong, clear
. qui mi xeq: by hhid, sort: egen x_sum = total(x)
// create incomplete observations in the household-level variable x_sum
. qui mi xeq 0: gen Mis_x = (x==.)
. qui mi xeq 0: by hhid, sort: egen Mis_total = total(Mis_x)
. qui mi xeq 0: replace x_sum = . if Mis_total>0
// create household-level data
. qui mi xeq: sort hhid pid; by hhid: drop if _n>1
. qui mi xeq: drop pid x /*include any other individual-level variables*/
// mark incomplete household-level observations
. mi register imputed x_sum
Below I provide a detailed discussion of the code above.
First, it is important to note that many group-specific summaries of imputed
variables, such as the household-level sums of x in our example, are so called
super-varying variables in individual-level datasets. Super-varying variables
are variables which may vary between imputations not only in the incomplete
observations but also in the complete observations; see -help mi glossary- for
more information. Super-varying variables can exist only in the -flong- (or
-flongsep-) style, so we should either start with this style or use -mi
convert- to convert to it before we create variables containing group-specific
summaries. To create household-specific sums of x, we can use -mi xeq: egen-.
So, we start by converting from the previously set -wide- style to -flong-:
. mi convert flong, clear
and then create a new variable x_sum containing household-specific sums of x:
. qui mi xeq: by hhid, sort: egen x_sum = total(x)
Because x_sum is a super-varying variable, it should not be registered in the
individual-level data.
Alberto will need to manually create new household-level variables for any
other individual-level variables of interest, which can be done in a loop.
Next, we replace all observations of x_sum within a household level for which
there is at least one missing value of x with missing values in the original
data (m=0):
. qui mi xeq 0: gen Mis_x = (x==.)
. qui mi xeq 0: by hhid, sort: egen Mis_total = total(Mis_x)
. qui mi xeq 0: replace x_sum = . if Mis_total>0
Once all aggregate variables are created, we can drop individual-level
observations except the first observation:
. qui mi xeq: sort hhid pid; by hhid: drop if _n>1
We can now drop all individual-level variables:
. qui mi xeq: drop pid x /*include any other individual-level variables*/
Finally, we register x_sum as imputed to mark incomplete household-level
observations.
. mi register imputed x_sum
The resulting dataset now corresponds to -mi- household-level data.
As a side note, Alfredo should consider taking into account the clustered
nature of his data during imputation of x. The following FAQ provides some
guidelines:
http://www.stata.com/support/faqs/stat/impute_cluster.html
If Alfredo has any questions, he should contact [email protected] and
they will be happy to help him further.
-- Yulia
[email protected]
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/