Survey Analysis – Combining Multiple Imputation and Survey Non-Response Adjustments (IPW)

epidemiologymultiple-imputationnon-responsepropensity-scoressurvey

Imagine the following scenario:
A population cohort (assume no or equal sampling weights) of say 10000 people had various demographics and health factors measured at baseline $X_{base}$(with some missing data). At a later timepoint all cohort members were asked a follow-up questionnaire which included further factors $X_{fu}$ and a binary outcome measure of a certain health indicator, $Y$. Let's say there was 50% non-response to this later questionnaire.

I want to impute missing baseline data (assumed MCAR/MAR) using multiple imputation (R package mice) and then adjust for non-response bias using a propensity-score/IPW based method (logistic regression of response $R=1$ vs no-response $R=0$) as there may be a pattern of non-response associated with a certain combination of the demographic/health factors in $X_{base}$.
You can assume we ultimately aim to fit a further logistic regression to investigate associations between $Y$ and factors $X_{base}$. This general question was considered by Seaman et al. (2012) and if my reading is correct has been used elsewhere in the literature.

The R package MatchThem for weighting multiply imputed datasets would seem perfect if I were looking at a treatment/exposure vs an outcome. I do not understand if I could however also use it in the above scenario. Using this package or otherwise, any help how to implement the combination of both MI and IPW steps would be great.

Some of my concerns are:

  1. If I should impute the missing $Y$s as well as $X$s together? It might help the subsequent IPW because it would keep more of the non-responders' data included $X_{base|R=0}$
  2. I think the final logistic regression for associations should only be done on those responding ($R=1$) independent of their imputed $Y$, indeed as recommended by Paul von Hippel? Makes sense to me as otherwise I could just MI all $X$ and $Y$ and not have to worry about the non-response or IPW.

Example R code for a fake cohort in case it helps with any answers

N_cohort = 10000L
set.seed(235)
Response = rep(c(1L,0L), each=N_cohort/2)
Y = c(sample(0:1, N_cohort/2, replace = TRUE, prob = runif(2)), rep(NA, N_cohort/2))
sex = gl(2, 1, N_cohort, labels = c("F", "M"))
age = round(rweibull(N_cohort, shape= 10, scale=55))
edu = factor(sample(letters[1:5], size = N_cohort, replace = TRUE, prob = c(0.01, 0.05, 0.2, 0.3, 0.44)))
inc = factor(sample(c(seq(0,1e5,2e4), NA), N_cohort, replace = TRUE, prob = c(0.03, 0.07, 0.26, 0.28, 0.14, 0.02, 0.2)))
A = round(rweibull(N_cohort, shape= 1, scale=5))
B = rnorm(N_cohort)
C = rnorm(10000, 24, 3)
D = ifelse(is.na(Y), NA, runif(N_cohort/2, 0, 20))

df = data.frame(Response, Y, sex, age, inc, edu, A, B, C, D)
df[,6:10] = as.data.frame(lapply(df[,6:10], function(.x) .x[ sample(c(TRUE, NA), prob = c(0.85, 0.03), size = length(.x), replace = TRUE)]))

Best Answer

You have done your research and I would refer to the same sources as you. I'm a fan of multiple imputation over IPW for missing data, including outcomes, but the "double-robustness" of imputation for covariates and IPW for outcomes is appealing as well. One downside of any IPW approach is that those with missing outcomes are removed from the analysis.

Interestingly, Seaman et al. (2012) don't even consider the analysis I would have done, which is to impute $X$ and $Y$ together, estimate weights for missingness in $Y$, and discard all units with missing $Y$ in the original dataset (i.e., discarding the imputed $Y$ values, which only exist to use to impute the missing $X$); they call this MI/IPW. With so much missingness in $Y$, a lot would depend on the correct imputation model, so having a missingness model in addition (i.e., using IPW) would seem beneficial to me. Discarding the units with missing $Y$ after imputation is also recommended by Kontopantelis et al. (2017). This strategy allows you to use $Y$ to impute $X$ but not rely on shakey imputed estimates of $Y$ for your final model.

Although you can probably use MatchThem, I think it would be better to just estimate the weights in each imputed dataset manually. If using the approach I mentioned, you only need weights for those with non-missing $Y$.

Below is a sketch of what this might looks like in R.

imp <- mice::mice(data, m = 20)

new_data <- lapply(1:20, function(i) {
  di <- complete(imp, i)
  di$nonmissY <- !is.na(data$Y)
  
  ps_fit <- glm(missY ~ x1 + x2 + x3, data = di, family = binomial)
  
  di_nonmissY <- subset(di, nonmissY)
  di_nonmissY$weights <- 1/(fitted(ps_fit)[di$nonmissY])
  di_nonmissY
})

#Check balance between original and weighted nonmissing cases
bal_data <- purrr::map_dfr(1:20, function(i) {
  di <- complete(imp, i)
  di$sample <- "full"
  di$weights <- 1
  new_data[[i]]$sample <- "nonmissing"
  
  dplyr::bind_rows(di, new_data[[i]])
}, id. = ".imp")

cobalt::bal.tab(sample ~ Y + x1 + x2 + x3, data = bal_data,
                weights = "weights", imp = ".imp")

#Estimate and combine effects
fits <- lapply(new_data, function(di) {
  lm(Y ~ x1 + x2 + x3, data = di, weights = weights)
})

betas <- mitools::MIextract(fits,fun = coef)
#Need robust SE for weights
vars <- mitools::MIextract(models, fun = sandwich::vcovHC)
summary(mitools::MIcombine(betas, vars))