Solved – Multiple imputation for missing values

missing datamultiple-imputationrspss

I would like to use imputation for replacing missing values in my data set under certain constraints.

For example, I'd like the imputed variable x1 to be greater or equal to the sum of my two other variables, say x2 and x3. I also want x3 to be imputed by either 0 or >= 14 and I want x2 to be imputed by either 0 or >= 16.

I tried defining these constraints in SPSS for multiple imputation, but in SPSS I can only define maximum and minimum values. Is there any way to define further constraints in SPSS or do you know any R package which would let me define such constraints for imputation of missing values?

My data is as follows:

   x1 =c(21, 50, 31, 15, 36, 82, 14, 14, 19, 18, 16, 36, 583, NA,NA,NA, 50, 52, 26, 24)
   x2 = c(0, NA, 18,0, 19, 0, NA, 0, 0, 0, 0, 0, 0,NA,NA, NA, 22, NA, 0, 0)
   x3 = c(0, 0, 0, 0, 0, 54, 0 ,0, 0, 0, 0, 0, 0, NA, NA, NA, NA, 0, 0, 0)
   dat=data.frame(x1=x1, x2=x2, x3=x3)
   > dat
       x1 x2 x3
   1   21  0  0
   2   50 NA  0
   3   31 18  0
   4   15  0  0
   5   36 19  0
   6   82  0 54
   7   14 NA  0
   8   14  0  0
   9   19  0  0
   10  18  0  0
   11  16  0  0
   12  36  0  0
   13 583  0  0
   14  NA NA NA
   15  NA NA NA
   16  NA NA NA
   17  50 22 NA
   18  52 NA  0
   19  26  0  0
   20  24  0  0

Best Answer

One solution is to write your own custom imputation functions for the mice package. The package is prepared for this and the setup surprisingly pain-free.

First we setup the data as suggested:

dat=data.frame(x1=c(21, 50, 31, 15, 36, 82, 14, 14, 19, 18, 16, 36, 583, NA,NA,NA, 50, 52, 26, 24), 
               x2=c(0, NA, 18,0, 19, 0, NA, 0, 0, 0, 0, 0, 0,NA,NA, NA, 22, NA, 0, 0), 
               x3=c(0, 0, 0, 0, 0, 54, 0 ,0, 0, 0, 0, 0, 0, NA, NA, NA, NA, 0, 0, 0))

Next we load the micepackage and see what methods it choose by default:

library(mice)
# Do a non-imputation
imp_base <- mice(dat, m=0, maxit = 0)

# Find the methods that mice chooses
imp_base$method
# Returns: "pmm" "pmm" "pmm"

# Look at the imputation matrix
imp_base$predictorMatrix
# Returns:
#   x1 x2 x3
#x1  0  1  1
#x2  1  0  1
#x3  1  1  0

The pmm stands for predictive mean matching - probably the most popular imputation algorithm for imputing continuous variables. It calculates the predicted value using a regression model and picks the 5 closest elements to the predicted value (by Euclidean distance). These chosen elements are called the donor pool and the final value is chosen at random from this donor pool.

From the prediction matrix we find that the methods get the variables passed that are of interest for the restrictions. Note that the row is the target variable and the column the predictors. If x1 did not have 1 in the x3 column we would have to add this in the matrix: imp_base$predictorMatrix["x1","x3"] <- 1

Now to the fun part, generating the imputation methods. I've chosen a rather crude method here where I discard all values if they don't meet the criteria. This may result in long loop time and it may potentially be more efficient to keep the valid imputations and only redo the remaining ones, it would require a little more tweaking though.

# Generate our custom methods
mice.impute.pmm_x1 <- 
  function (y, ry, x, donors = 5, type = 1, ridge = 1e-05, version = "", 
            ...) 
  {
    max_sum <- sum(max(x[,"x2"], na.rm=TRUE),
                   max(x[,"x3"], na.rm=TRUE))
    repeat{
      vals <- mice.impute.pmm(y, ry, x, donors = 5, type = 1, ridge = 1e-05,
                              version = "", ...)
      if (all(vals < max_sum)){
        break
      }
    }
    return(vals)
  }

mice.impute.pmm_x2 <- 
  function (y, ry, x, donors = 5, type = 1, ridge = 1e-05, version = "", 
            ...) 
  {
    repeat{
      vals <- mice.impute.pmm(y, ry, x, donors = 5, type = 1, ridge = 1e-05,
                              version = "", ...)
      if (all(vals == 0 | vals >= 14)){
        break
      }
    }
    return(vals)
  }

mice.impute.pmm_x3 <- 
  function (y, ry, x, donors = 5, type = 1, ridge = 1e-05, version = "", 
            ...) 
  {
    repeat{
      vals <- mice.impute.pmm(y, ry, x, donors = 5, type = 1, ridge = 1e-05,
                              version = "", ...)
      if (all(vals == 0 | vals >= 16)){
        break
      }
    }
    return(vals)
  }

Once we are done defining the methods we simple change the previous methods. If you only want to change a single variable then you can simply use imp_base$method["x2"] <- "pmm_x2" but for this example we will change all (the naming is not necessary):

imp_base$method <- c(x1 = "pmm_x1", x2 = "pmm_x2", x3 = "pmm_x3")

# The predictor matrix is not really necessary for this example
# but I use it just to illustrate in case you would like to 
# modify it
imp_ds <- 
  mice(dat, 
       method = imp_base$method, 
       predictorMatrix = imp_base$predictorMatrix)

Now let's have a look at the third imputed dataset:

> complete(imp_ds, action = 3)
    x1 x2 x3
1   21  0  0
2   50 19  0
3   31 18  0
4   15  0  0
5   36 19  0
6   82  0 54
7   14  0  0
8   14  0  0
9   19  0  0
10  18  0  0
11  16  0  0
12  36  0  0
13 583  0  0
14  50 22  0
15  52 19  0
16  14  0  0
17  50 22  0
18  52  0  0
19  26  0  0
20  24  0  0

Ok, that does the job. I like this solution as you can piggyback on top of mainstream functions and just add the restrictions that you find meaningful.

Update

In order to enforce the rigorous restraints @t0x1n mentioned in the comments, we may want to add the following abilities to the wrapper function:

  1. Save valid values during the loops so that data from previous, partially successful runs is not discarded
  2. An escape mechanism in order to avoid infinite loops
  3. Inflate the donor pool after trying x times without finding a suitable match (this primarily applies to pmm)

This results in a slightly more complicated wrapper function:

mice.impute.pmm_x1_adv <-   function (y, ry, 
                                      x, donors = 5, 
                                      type = 1, ridge = 1e-05, 
                                      version = "", ...) {
  # The mice:::remove.lindep may remove the parts required for
  # the test - in those cases we should escape the test
  if (!all(c("x2", "x3") %in% colnames(x))){
    warning("Could not enforce pmm_x1 due to missing column(s):",
            c("x2", "x3")[!c("x2", "x3") %in% colnames(x)])
    return(mice.impute.pmm(y, ry, x, donors = 5, type = 1, ridge = 1e-05,
                           version = "", ...))
  }

  # Select those missing
  max_vals <- rowSums(x[!ry, c("x2", "x3")])

  # We will keep saving the valid values in the valid_vals
  valid_vals <- rep(NA, length.out = sum(!ry))
  # We need a counter in order to avoid an eternal loop
  # and for inflating the donor pool if no match is found
  cntr <- 0
  repeat{
    # We should be prepared to increase the donor pool, otherwise
    # the criteria may become imposs
    donor_inflation <- floor(cntr/10)
    vals <- mice.impute.pmm(y, ry, x, 
                            donors = min(5 + donor_inflation, sum(ry)), 
                            type = 1, ridge = 1e-05,
                            version = "", ...)

    # Our criteria check
    correct <- vals < max_vals
    if (all(!is.na(valid_vals) |
              correct)){
      valid_vals[correct] <-
        vals[correct]
      break
    }else if (any(is.na(valid_vals) &
                    correct)){
      # Save the new valid values
      valid_vals[correct] <-
        vals[correct]
    }

    # An emergency exit to avoid endless loop
    cntr <- cntr + 1
    if (cntr > 200){
      warning("Could not completely enforce constraints for ",
              sum(is.na(valid_vals)),
              " out of ",
              length(valid_vals),
              " missing elements")
      if (all(is.na(valid_vals))){
        valid_vals <- vals
      }else{
        valid_vals[is.na(valid_vals)] <- 
          vals[is.na(valid_vals)]
      }
      break
    }
  }
  return(valid_vals)
}

Note that this does not perform that well, most likely due to that the suggested data set fails the constraints for all cases without missing. I need to increase the loop length to 400-500 before it even starts to behave. I assume that this is unintentional, your imputation should mimic how the actual data is generated.

Optimization

The argument ry contains the non-missing values and we could possibly speed up the loop by removing the elements that we have found eligible imputations, but as I'm unfamiliar with the inner functions I have refrained from this.

I think the most important thing when you have strong constraints that take time to full-fill is to parallelize your imputations (see my answer on CrossValidated). Most have today computers with 4-8 cores and R only uses one of them by default. The time can be (almost) sliced in half by doubling the number of cores.

Missing parameters at imputation

Regarding the problem of x2 being missing at the time of imputation - mice actually never feeds missing values into the x-data.frame. The mice method includes filling in some random value at start. The chain-part of the imputation limits the impact from this initial value. If you look at the mice-function you can find this prior to the imputation call (the mice:::sampler-function):

...
if (method[j] != "") {
  for (i in 1:m) {
    if (nmis[j] < nrow(data)) {
      if (is.null(data.init)) {
        imp[[j]][, i] <- mice.impute.sample(y, 
                                            ry, ...)
      }
      else {
        imp[[j]][, i] <- data.init[!ry, j]
      }
    }
    else imp[[j]][, i] <- rnorm(nrow(data))
  }
}
...

The data.init can be supplied to the mice function and the mice.imput.sample is a basic sampling procedure.

Visiting sequence

If visiting sequence is important you can specify the order in which the mice-function runs the imputations. Default is from 1:ncol(data) but you can set the visitSequence to be anything you like.

Related Question