FAQ: Accounting for clustering with mi impute

Home / Resources & support / FAQs / Accounting for clustering with mi impute

How can I account for clustering when creating imputations with mi impute?

Title		Accounting for clustering with mi impute
Authors		Wesley Eddings and Yulia Marchenko, StataCorp

Note 1: This frequently asked question (FAQ) assumes familiarity with multiple imputation. Please see the documentation entries [MI] intro substantive and [MI] intro if you are unfamiliar with the method. Also, if your data have already been imputed, see the documentation entry [MI] mi import on how to import your data to mi and see [MI] mi estimate on how to analyze your multiply imputed data.

Note 2: Because the mi impute command is based on random draws, results may differ on previous versions as a consequence of the 64-bit Mersenne Twister pseudorandom numbers, which was added to Stata in version 14.

As of Stata 11.1, the mi estimate command can be used to analyze multiply imputed clustered (panel or longitudinal) data by fitting several clustered-data models, such as xtreg, xtlogit, and mixed; see mi estimation for the full list. However, we must also account for clustering when creating multiply imputed data; this FAQ will show how.

We can create multiply imputed data with mi impute, Stata’s official command for imputing missing values. There is no definitive recommendation in the literature on the best way to impute clustered data, but three strategies have been suggested:

Include indicator variables for clusters in the imputation model.
Impute data separately for each cluster.
Use a multivariate normal model to impute all clusters simultaneously.

We will explain how to carry out each strategy with mi impute. We will assume for now that we have data in long form and that only one variable has missing values; extensions to more than one imputed variable will be described later.

Strategy 1: Include indicator variables for clusters in the imputation model

If there are not too many clusters, we can account for clustering by including cluster indicators in our imputation model. The factor-variable syntax of Stata makes it easy to include the indicators with mi impute: we do not even have to generate any new variables.

Our first example dataset, data1.dta, has 40 observations within each of 10 clusters; the variable id indexes observations within clusters. Ten percent of the observations have missing values for the observation-level predictor x; no values of the response y are missing. We want to study the association between y and the partially observed predictor x while accounting for the association within clusters.

. use http://www.stata.com/support/faqs/data1

. describe

Contains data from http://www.stata.com/support/faqs/data1.dta
 Observations:           400                  
    Variables:             4                  29 Jul 2010 14:56


 
Variable      Storage   Display    Value
    name         type    format    label      Variable label
 
cluster         float   %9.0g                 
id              float   %9.0g                 
y               double  %10.0g                
x               double  %10.0g

 
Sorted by:

. sort cluster id

. by cluster: summarize y x


 
-> cluster = 1                                                              
 

    Variable          Obs        Mean    Std. dev.       Min        Max
  
           y           40    97.49968    33.62784   23.14309   160.8217
           x           38    30.00173    7.943642   12.95944   42.72091


 
-> cluster = 2                                                              
 

    Variable          Obs        Mean    Std. dev.       Min        Max
  
           y           40    100.2756    31.70555   20.78498   151.5145
           x           39    30.77486    8.020621   5.549631   44.48839


 
-> cluster = 3                                                              
 

    Variable          Obs        Mean    Std. dev.       Min        Max
  
           y           40    147.5954    37.44895   71.18038    217.404
           x           38    31.85539    8.794805   16.45632   49.47706


 

 ...
 (output omitted)

We impute the missing values of x with mi impute regress, a Gaussian regression imputation method. We account for clustering by including in our imputation model the factor variable i.cluster. The response y should also be included as a predictor:

. mi set wide

. mi register imputed x

. mi impute regress x y i.cluster, add(5) noisily rseed(123)

Running regress on observed data:


      Source         SS           df       MS      Number of obs   =       360
     F(10, 349)      =     32.74
       Model    11088.9434        10  1108.89434    Prob > F        =    0.0000
    Residual     11821.207       349  33.8716533    R-squared       =    0.4840
     Adj R-squared   =    0.4692
       Total    22910.1504       359  63.8165749    Root MSE        =    5.8199



           x   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
  
           y     .1572187   .0088668    17.73   0.000     .1397797    .1746578
                                                                              
     cluster                                                                  
          2      .5249299   1.326672     0.40   0.693    -2.084348    3.134208
          3      -6.21639   1.410625    -4.41   0.000    -8.990786   -3.441994
          4     -1.153281   1.364677    -0.85   0.399    -3.837306    1.530743
          5      .6848743   1.387169     0.49   0.622    -2.043388    3.413136
          6      -4.79826   1.409348    -3.40   0.001    -7.570143   -2.026376
          7     -1.828347    1.34203    -1.36   0.174     -4.46783    .8111363
          8     -1.427531   1.349231    -1.06   0.291    -4.081178    1.226117
          9      1.565089   1.353659     1.16   0.248    -1.097267    4.227444
         10     -2.067867   1.384883    -1.49   0.136    -4.791633    .6558993
                                                                              
       _cons     14.49285   1.287011    11.26   0.000     11.96157    17.02412



Univariate imputation                       Imputations =        5
Linear regression                                 added =        5
Imputed: m=1 through m=5                        updated =        0



                           Observations per m         
           
  Variable     Complete   Incomplete   Imputed       Total

         x          360           40        40         400

We used the noisily option of mi impute to display the intermediate regression output, which shows that nine dummy variables were properly included for the ten clusters. We now fit our analysis model by using, for example, mixed with the mi estimate: prefix:

. mi estimate: mixed y x || cluster:

Multiple-imputation estimates                   Imputations       =          5
Mixed-effects ML regression                     Number of obs     =        400

Group variable: cluster                         Number of groups  =         10
                                                Obs per group:
                                                              min =         40
                                                              avg =       40.0
                                                              max =         40
                                                Average RVI       =     0.0723
                                                Largest FMI       =     0.1366
DF adjustment:   Large sample                   DF:     min       =     238.81
                                                        avg       =  22,755.03
                                                        max       =  89,559.86
Model F test:       Equal FMI                   F(   1,  292.0)   =     285.80
                                                Prob > F          =     0.0000



           y   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
           x      2.92631   .1730981    16.91   0.000     2.585632    3.266988
       _cons     22.40641    7.02119     3.19   0.001     8.627185    36.18563





  Random-effects parameters      Estimate   Std. err.     [95% conf. interval]
   
cluster: Identity              
                   sd(_cons)     14.02273   3.406021      8.711221    22.57284
   
                sd(Residual)     25.46452   .9772013      23.61045    27.46419

The coefficient of x is estimated to be about 3 with a standard error of about 0.2, and the cluster-level intercepts have a mean of about 22 with a standard deviation of about 14. Had we not included the cluster variable in our imputation model, we would have obtained a smaller estimate of the variance component for clusters.

Graham (2009) suggests that cluster indicators can work well for as many as 35 indicator variables. Strategy 1 is best suited for data with few clusters and many observations within each cluster.

Strategy 2: Impute data separately for each cluster

By including clusters as indicator variables in our imputation model (strategy 1), we allow the regression function of the imputed variable to vary by cluster. More generally, we can allow the distributions of the imputed values to differ among clusters by imputing each cluster separately (Graham 2009). Since Stata 12, we can use mi impute with the by() option.

Our second example dataset, data2.dta, like the first, includes a response variable with no missing values and a predictor x with 10% missing values. We have 50 observations within each of 20 clusters. We will impute each cluster separately and then fit an analysis model with mixed.

. use http://www.stata.com/support/faqs/data2.dta

. mi set wide

. mi register imputed x

.  mi impute regress x y, add(5) by(cluster, noreport) rseed(123)

Univariate imputation                       Imputations =        5
Linear regression                                 added =        5
Imputed: m=1 through m=5                        updated =        0



                                   Observations per m
by()                 
          Variable     Complete   Incomplete   Imputed       Total

cluster = 1                                                       
                 x           44            6         6          50
cluster = 2                                                       
                 x           47            3         3          50
...
 
cluster = 19                                                      
                 x           45            5         5          50
cluster = 20                                                      
                 x           46            4         4          50
                                                                  

Overall                                                           
                 x          900          100       100        1000

(Complete + Incomplete = Total; Imputed is the minimum across m
 of the number of filled-in observations.)

We now fit mi estimate: mixed to our multiply imputed data:

. mi estimate: mixed y x || cluster:

Multiple-imputation estimates                   Imputations       =          5
Mixed-effects ML regression                     Number of obs     =      1,000

Group variable: cluster                         Number of groups  =         20
                                                Obs per group:
                                                              min =         50
                                                              avg =       50.0
                                                              max =         50
                                                Average RVI       =     0.0546
                                                Largest FMI       =     0.1551
DF adjustment:   Large sample                   DF:     min       =     187.43
                                                        avg       = 156,793.24
                                                        max       = 422,770.46
Model F test:       Equal FMI                   F(   1, 2471.3)   =    1577.92
                                                Prob > F          =     0.0000



           y   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
           x     8.134255   .2047741    39.72   0.000     7.732709    8.535802
       _cons     19.50823   6.343581     3.08   0.002      7.07497     31.9415





  Random-effects parameters      Estimate   Std. err.     [95% conf. interval]
   
cluster: Identity              
                   sd(_cons)     26.52947   4.312226      19.29167    36.48273
   
                sd(Residual)     30.48525   .7451697      29.05013    31.99127

The coefficient for x is about 8 with a standard error of about 0.2, and the intraclass correlation is about (27²)/(27² + 30²) = 0.45. The intraclass correlation ranges from zero to one, and larger values mean that the clustering variable is more informative.

Imputing each cluster separately requires a sufficient number of observations in each cluster.

Strategy 3: Use a multivariate normal model to impute all clusters simultaneously

A third way to account for within-cluster correlation is to impute jointly over clusters using a multivariate normal model. Observations within clusters may be viewed as a sample from a multivariate normal distribution with an unrestricted covariance structure. The multivariate normal strategy works well when there are only a few observations in each cluster (Allison 2002). There is a limitation to this strategy: it is best suited to balanced repeated-measures data.

We will illustrate the multivariate normal strategy with a new balanced dataset. It has 50 clusters but only 5 observations within each cluster. (Such data might occur, for example, in a repeated-measures study of subjects’ test scores.) We would once again like to impute missing values of x and then fit a linear mixed-effects model with mixed.

Before we can fit the multivariate normal imputation model, we will need to reshape our data to wide form so that each cluster occupies a single row. The variable id indexes observations within clusters.

. use http://www.stata.com/support/faqs/data3

. reshape wide x y, i(cluster) j(id)
(j = 1 2 3 4 5)


Data                               Long   ->   Wide


Number of observations              250   ->   50          
Number of variables                   4   ->   11          
j variable (5 values)                id   ->   (dropped)
xij variables:
                                      x   ->   x1 x2 ... x5
                                      y   ->   y1 y2 ... y5

We can now impute with mi impute mvn, and the multivariate normal regression model will allow interdependencies within clusters.

. mi set wide

. mi register imputed x1 x2 x3 x4 x5

. mi impute mvn x1 x2 x3 x4 x5 = y1 y2 y3 y4 y5, add(5) rseed(123)

Performing EM optimization:
  observed log likelihood = -296.02862 at iteration 16

Performing MCMC data augmentation ... 

Multivariate imputation                      Imputations =        5
Multivariate normal regression                     added =        5
Imputed: m=1 through m=5                         updated =        0

Prior: uniform                                Iterations =      500
                                                 burn-in =      100
                                                 between =      100



                                   Observations per m
                     
          Variable     Complete   Incomplete   Imputed       Total

                x1           42            8         8          50
                x2           43            7         7          50
                x3           46            4         4          50
                x4           47            3         3          50
                x5           47            3         3          50

(Complete + Incomplete = Total; Imputed is the minimum across m
 of the number of filled-in observations.)

To use mi estimate: mixed, we need to reshape our data back to long form. With mi data, we need to use the mi reshape command to do this:

. mi reshape long x y, i(cluster) j(id)

reshaping m=0 data ...
(j = 1 2 3 4 5)


Data                               Wide   ->   Long


Number of observations               50   ->   250         
Number of variables                  11   ->   4           
j variable (5 values)                     ->   id
xij variables:
                           x1 x2 ... x5   ->   x
                           y1 y2 ... y5   ->   y



reshaping m=1 data ...

reshaping m=2 data ...

reshaping m=3 data ...

reshaping m=4 data ...

reshaping m=5 data ...

assembling results ...

We are now ready to use mi estimate: mixed:

. mi estimate: mixed y x || cluster:

Multiple-imputation estimates                   Imputations       =          5
Mixed-effects ML regression                     Number of obs     =        250

Group variable: cluster                         Number of groups  =         50
                                                Obs per group:
                                                              min =          5
                                                              avg =        5.0
                                                              max =          5
                                                Average RVI       =     0.0942
                                                Largest FMI       =     0.2699
DF adjustment:   Large sample                   DF:     min       =      65.13
                                                        avg       = 551,115.87
                                                        max       = 2201656.39
Model F test:       Equal FMI                   F(   1,   65.1)   =      45.75
                                                Prob > F          =     0.0000



           y   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
           x     .8791031   .1299657     6.76   0.000     .6195535    1.138653
       _cons     2.570404    2.13238     1.21   0.229    -1.625142     6.76595





  Random-effects parameters      Estimate   Std. err.     [95% conf. interval]
   
cluster: Identity              
                   sd(_cons)     11.37668   1.183462      9.278313    13.94961
   
                sd(Residual)     5.042148   .2573897      4.561861       5.573

Conclusion

All three strategies can be modified to impute more than one variable. The indicator-variable and separate-imputation strategies, strategies 1 and 2, require a multivariate imputation method such as mi impute monotone, mi impute chained, or mi impute mvn in place of a univariate method such as mi impute regress. The multivariate normal strategy, strategy 3, can be extended by adding extra variables to the left-hand side of the equation in mi impute mvn. If we wanted to impute x and another variable z, the commands might look like this:

. reshape wide x y z, i(cluster) j(id)
. mi set wide
. mi register imputed x1 x2 x3 x4 x5 z1 z2 z3 z4 z5
. mi impute mvn x1 x2 x3 x4 x5 z1 z2 z3 z4 z5 = y1 y2 y3 y4 y5, add(5)
. mi reshape long x y z, i(cluster) j(id)

All our examples had the same two-level structure—observations within clusters. More-complex multilevel structures are an active research area; one recent paper describing imputation for multilevel models is Goldstein et al. (2009).

References

Allison, P. D. 2002.: Missing Data. Thousand Oaks, CA: Sage.

Goldstein, H., J. R. Carpenter, M. G. Kenward, and K. A. Levin. 2009.: Multilevel models with multivariate mixed response types. Statistical Modelling 9: 173–197.

Graham, J. W. 2009.: Missing data analysis: Making it work in the real world. Annual Review of Psychology 60: 549–576.

How can I account for clustering when creating imputations with mi impute?

Strategy 1: Include indicator variables for clusters in the imputation model

Strategy 2: Impute data separately for each cluster

Strategy 3: Use a multivariate normal model to impute all clusters simultaneously

Conclusion

References

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies

Variable		Obs Mean Std. dev. Min Max

y		40 97.49968 33.62784 23.14309 160.8217
x		38 30.00173 7.943642 12.95944 42.72091

Variable		Obs Mean Std. dev. Min Max

y		40 100.2756 31.70555 20.78498 151.5145
x		39 30.77486 8.020621 5.549631 44.48839

Variable		Obs Mean Std. dev. Min Max

y		40 147.5954 37.44895 71.18038 217.404
x		38 31.85539 8.794805 16.45632 49.47706

Source	SS df MS	Number of obs = 360
		F(10, 349) = 32.74
Model	11088.9434 10 1108.89434	Prob > F = 0.0000
Residual	11821.207 349 33.8716533	R-squared = 0.4840
		Adj R-squared = 0.4692
Total	22910.1504 359 63.8165749	Root MSE = 5.8199


x		Coefficient Std. err. t P>\|t\| [95% conf. interval]

y		.1572187 .0088668 17.73 0.000 .1397797 .1746578

cluster
2		.5249299 1.326672 0.40 0.693 -2.084348 3.134208
3		-6.21639 1.410625 -4.41 0.000 -8.990786 -3.441994
4		-1.153281 1.364677 -0.85 0.399 -3.837306 1.530743
5		.6848743 1.387169 0.49 0.622 -2.043388 3.413136
6		-4.79826 1.409348 -3.40 0.001 -7.570143 -2.026376
7		-1.828347 1.34203 -1.36 0.174 -4.46783 .8111363
8		-1.427531 1.349231 -1.06 0.291 -4.081178 1.226117
9		1.565089 1.353659 1.16 0.248 -1.097267 4.227444
10		-2.067867 1.384883 -1.49 0.136 -4.791633 .6558993

_cons		14.49285 1.287011 11.26 0.000 11.96157 17.02412


	Observations per m

Variable	Complete Incomplete Imputed	Total

x	360 40 40	400


y		Coefficient Std. err. t P>\|t\| [95% conf. interval]

x		2.92631 .1730981 16.91 0.000 2.585632 3.266988
_cons		22.40641 7.02119 3.19 0.001 8.627185 36.18563


Random-effects parameters		Estimate Std. err. [95% conf. interval]

cluster: Identity
sd(_cons)		14.02273 3.406021 8.711221 22.57284

sd(Residual)		25.46452 .9772013 23.61045 27.46419


	Observations per m
by()
Variable	Complete Incomplete Imputed	Total

cluster = 1
x	44 6 6	50
cluster = 2
x	47 3 3	50
...

cluster = 19
x	45 5 5	50
cluster = 20
x	46 4 4	50


Overall
x	900 100 100	1000


y		Coefficient Std. err. t P>\|t\| [95% conf. interval]

x		8.134255 .2047741 39.72 0.000 7.732709 8.535802
_cons		19.50823 6.343581 3.08 0.002 7.07497 31.9415


y		Coefficient Std. err. t P>\|t\| [95% conf. interval]

x		.8791031 .1299657 6.76 0.000 .6195535 1.138653
_cons		2.570404 2.13238 1.21 0.229 -1.625142 6.76595

Stata/MP4 Annual License (download)

How can I account for clustering when creating imputations with mi impute?

Strategy 1: Include indicator variables for clusters in the imputation model

Strategy 2: Impute data separately for each cluster

Strategy 3: Use a multivariate normal model to impute all clusters simultaneously

Conclusion

References

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies