Solved – Predictive Mean Matching as Single Imputation

data-imputationmissing datamultiple-imputationrregression-strategies

Multiple imputation is known to be advantageous compared to single imputation. However, in practice there are often non-statistical reasons why multiple imputation can not be used (e.g. the data recipient isn't educated enough to deal with multiple data sets; the boss of a company doesn't want it; …).

Predictive mean matching (PMM) is a bayesian imputation method that is known to be one of the best imputation methods, when multiple imputation can be applied. Now, I am wondering whether PMM could also be used as single imputation method, when multiple imputation is not possible.

Consider the following example:

  • You have a typical data set including a continous variable (e.g. income / age) with missing values.
  • The data can be imputed only once (single imputation).
  • You have to choose between typical single imputation methods (e.g. stochastic regression imputation / hot deck imputation) OR predictive mean matching.

Illustration in R:

# Create some example data
set.seed(95159)                      # Seed
N <- 20000                           # Sample size

y <- rnorm(N)                        # Target variable
x1 <- rnorm(N) + 0.1 * y             # Some auxiliary variables
x2 <- rnorm(N) - 0.05 * y
x3 <- rnorm(N) + 0.2 * x1 + 0.3 * x2

y[rbinom(N, 1, 0.1) == 1] <- NA      # 10% missings in y

data <- data.frame(y, x1, x2, x3)    # Create data set


# Impute data via stochastic regression imputation
# and predictive mean matching
library("mice")                      # Load mice package

imp_sri <- mice(data, m = 1, method = "norm.nob")
imp_pmm <- mice(data, m = 1, method = "pmm")

data_sri <- complete(imp_sri)
data_pmm <- complete(imp_pmm)


# Compare results
plot(density(data$y, na.rm = TRUE), xlab = "y",
     main = "Comparison Observed & Imputed)

points(density(data_sri$y[is.na(data$y)]), typ = "l", col = 2)
points(density(data_pmm$y[is.na(data$y)]), typ = "l", col = 3)

legend("topleft",
       c("Observed", "Stochastic Regression", "Predictive Mean Matching"),
       col = c("black", "red", "green"), lty = 1)

enter image description here

The simplified example doesn't reveal a substantial difference between stochastic regression imputation and single predictive mean matching. However, in more complex data scenarios (e.g. heteroscedastic data), predictive mean matching is usually advantageous.

Question: If multiple imputation can not be used, is there any reason, why single predictive mean matching should not be used instead of typical single imputation methods such as stochastic regression imputation?

Best Answer

The two approaches should be equally good or bad. The bigger point is that a boss who wants you do dumb down an analysis is not respecting the statistician for her area of expertise. Single imputation, whether regression or PMM can be used as you did by taking only one draw, and the resulting fitted equation will be nearly unbiased. But it will have higher variance and we'll have no way of getting standard errors on the coefficient estimates. So inform your boss of what exactly the trade-offs are.

Related Question