Solved – Questions on multiple imputation with MICE for a multigroup-SEM-analysis? (including survey weights)

micemultiple-imputationrstructural-equation-modelingweighted-data

I am planning to do a multigroup SEM analysis. I gathered survey data and calculated a survey weight. Some of my variables have item nonresponse (mostly around 5% missings).

I´ve decided to use multiple imputation to handle the missing data. First, i used LittleMCAR() test to check for the missingness mechanism. I also used TestMCARNormality() from Jamshidian et al. which has a nonparametric test of MCAR for homogenity of covariances. The latter didn´t reject MCAR, the LittleMCAR test did (p=8.3%). Because i assume my data to be MAR, my data was split in men/women and I applied the LittleMCAR() test for each subgroup. This time MCAR was not rejected in both subgroups.

I´ve read (see: Enders, C., & Gottschall, A. (2011). Multiple Imputation Strategies for Multiple Group Structural Equation Models. Structural Equation Modeling: A Multidisciplinary Journal, 35-54.) that if I plan to do a multigroup SEM analysis, I should do a separate multiple imputation for each group (in this case: men/women). The R package MICE will be used for the imputation.

Now my questions:

1.) Should use the default "massive imputation" predictormatrix from MICE
predictorMatrix = (1 - diag(1, ncol(data)), that uses all variables from the dataset as predictors for the imputation model, or should i use quickpred() to generate a predictormatrix? quickpred uses some criteria (like correlation of predictor and target-variable) to select a set of predictors for each variable, that will be imputed.

quickpred(datensatz_gender_0, include=c("weight_trunc"),exclude=c("ID","X","gender"),mincor = 0.1)

2.) Should I include the survey weight in the predictor matrix?

After imputation, the list of imputed datasets will be given to the survey()-package (for weighting purposes), then i will use the lavaan to specify my model, which will use the imputed data survey object. This lavaan model will then be passed to lavaan.survey(), so I can use the survey weights together with the imputed data. As far, as I´ve understood, lavaan.survey will then pool the results…

It would be great, if somebody can give me an answer to this question.
Thank you!

Best Answer

(I'm the creator of lavaan.survey)

As Stas already indicated, the combination (multiple imputation * complex sampling) can be tricky business. The main papers are Kott (1995) and Kim, Brick & Fuller (2006).

Here are some considerations:

As mentioned by Stas, all the usual best practices of MI apply. Considering the below, I would probably not use quickpred() initially. There is a risk it will discard things that you actually need. It might help to make some reasonable subselection though.
If you have weights, these need to be included in the imputation model as a covariate (Kim et al. 2006, p. 518). Since you are doing multiple group analysis ("domain estimation"), you also need to include the interaction between the group dummies and the weights in the imputation model (p. 519).
If you have strata and clusters, things become more complicated. The imputation model needs to account for the resulting correlation between the observations. If not you will get the wrong standard errors (Kim et al. 2006: p. 514). A model-based way of doing this might be to include strata as fixed effects and clusters as random effects in a Bayesian imputation model. A more survey-like approach would be to follow Stas' suggestion and use a resampling procedure that respects the strata and clusters. For example, in bootstrapping and with just the clusters, you would sample a random cluster (PSU) with replacement and then individuals (2SUS) with replacement within the sampled clusters.

Another advantage of Stas' resampling suggestion, even without strata and clusters, is that you will account for the uncertainty about the parameters of the imputation model including that caused by the weights. I am not sure if mice does this accurately by default. This is usually a relatively small additional term in the variance but it might make a difference.

Once you have the multiply imputed datasets, you can just pass these as an imputationList to lavaan.survey (see the JSS lavaan.survey paper). lavaan.survey will then do all the usual MI pooling calculations for you. So you don't need to manually fit a model separately for each imputation!

Hope this helps,

All the best, Daniel

P.S. Thanks to Stas and @Gaming_dude who brought this post to my attention. I would be happy to continue the conversation (here, lavaan Google discussion group, twitter, email..)!

Related Solutions

Solved – perform Random Forest AFTER multiple imputation with MICE

This is not a direct answer to your question, and I don't have enough reputation to comment, but one thing you can do is use the Machine Learning in R package. There are many random forest learner implementations there that can use data with missing values. You can also tune the learners based on what your dataset is.

Links to the package and documentation are on the main tutorial page, here:

https://mlr-org.github.io/mlr-tutorial/release/html/index.html

Also, consider that answering your question becomes much easier if you provide a sample of your dataset.

If you need a direct answer, looping a series of RF calls on the imputed datasets might work. E.g. if you have five imputations:

res = data.frame(matrix(0,nrow=nrow(test),ncol=5)
for (i in 1:5){
  data = complete(miceResult, 1)
  rf.res = cforest(data,formula ~ [which formula?])
  res[,i] = predict(rf.res, test)
}

Then you can pool the results by majority voting or averaging, depending on your dataset. You can also group the 5 imputations together and train the learner with the combined dataset. Both methods are suboptimal, however.

Hope this helps.

Solved – Manipulating data for propensity score matching following multiple imputation with mice package

Your second method, using complete(imp, "long") is what you need to do. You've probably read the Mitra & Reiter paper distinguishing between the "Within" and "Across" methods, since you mentioned wanting to average propensity scores. With imputed outcome data, you need to use the "Within" approach, and then combine and adjust your estimates using Rubin's rules. So, for each imputed data set, you will do the following: run matchit(), and then run your outcome regression model on the matched data set. Once you have done this for each imputed data set, you will recombine the model fits using as.mira() and then summarize your mira output. Here is some sample code:

imp.data <- complete(imp, "long")
fit.list <- setNames(vector("list", length(unique(imp.data$.imp))),
                     unique(imp.data$.imp))
for (i in unique(imp.data2$.imp)) {
  m.out <- matchit(t ~ v1 + v2 + v3, data = imp.data[imp.data$.imp == i,])
  fit.list[[i]] <- glm(y ~ t, data = match.data(m.out))
}
fit.list.mira <- as.mira(fit.list) #combines into mira object for pool()
summary(pool(fit.list.mira))

Hope this helps!

Best Answer

Related Solutions

Solved – perform Random Forest AFTER multiple imputation with MICE

Solved – Manipulating data for propensity score matching following multiple imputation with mice package

Related Question