Solved – Multiple Imputation and “Conditional Missing” values

micemultiple-imputationr

I am a lonely peon currently researching and playing around with multiple imputation. I am using MICE in R to impute random missing data; however, I run into a problem when attempting to account for conditional or structured NAs in a dataset.

I'll provide a simplistic dataset in an attempt to illustrate my meaning:

TestData <- data.frame(Condition= c(1,1,1,1,2,NA,2,2), 
Dependent1=c(1,NA,2,3,NA,NA,NA,NA),
Dependent2=c(1,12,44,1,NA,NA,NA,NA),
Dependent3=c(NA,2,3,5,NA,NA,NA,NA), 
UnaffiliatedQ=c(1,NA,3,2,27,NA,32,35))

TestData$Condition <- factor(TestData$Condition,
                     levels = c(1,2),
                     labels = c("Yes","No"))

In my example, the variable "Condition" is a gatekeeper question which determines whether a respondent needs to fill the next three questions (Dependent#). If a respondent answers with "No" and he/she does not see the next three questions, then they are marked as NAs – though not technically missing/ they are structurally not applicable.

I've come to ask what CV would do in this type of situation? If I Impute the NA value in the Condition variable, along with those in Dependent1, Dependent2, and Dependent3, how would I ensure that I don't end up with values in Dependent# that don't make sense (constraints)?

I've thought of possible solutions – but none that I think would be valid or a good idea. (e.g., creating a structured missing value like -999/ subsetting the dataframe based on conditional answers). I've looked around for possible literature or walkthroughs for situations like this; however, I've come up empty handed.

The other alternative is that I've simply been running down the rabbit hole of multiple imputation and this is not the correct use of it. I posted on SO a few days ago, and was told to post here 🙂

I appreciate your thoughts and help all!

Best Answer

I suggest to impute in 2 steps. Impute Condition first and then impute Dependent1-3 for the subset of Condition = Yes.

# I extended your data a bit, to get more cases.
TestData <- data.frame(Condition= c(1,1,1,1,2,NA,2,2, sample(1:2, 500, 1)), 
                   Dependent1=c(1,NA,2,3,NA,NA,NA,NA, sample(1:3, 500, 1)),
                   Dependent2=c(1,12,44,1,NA,NA,NA,NA, sample(1:44, 500, 1)),
                   Dependent3=c(NA,2,3,5,NA,NA,NA,NA, sample(1:5, 500, 1)), 
                   UnaffiliatedQ=c(1,NA,3,2,27,NA,32,35, sample(1:35, 500, 1)))
TestData[10:25, 1:4] <- NA

TestData$Condition <- factor(TestData$Condition,
                         levels = c(1,2),
                         labels = c("Yes","No"))



library("mice")

# Step 1: Imputation of condition
imp <- mice(TestData[ , c(1, 5)], m = 1)
data_imp <- complete(imp)

TestData$Condition <- as.factor(as.character(data_imp$Condition))
TestData$UnaffiliatedQ <- as.numeric(data_imp$UnaffiliatedQ)

# Subset of TestData (Condition == Yes)
TestData_sub <- TestData[TestData$Condition == "Yes", ]

# Step 2: Imputation of Dependent1-3
imp_2 <- mice(TestData_sub, m = 1)
data_imp_2 <- complete(imp_2)

TestData$Dependent1[TestData$Condition == "Yes"] <- as.numeric(data_imp_2$Dependent1)
TestData$Dependent2[TestData$Condition == "Yes"] <- as.numeric(data_imp_2$Dependent2)
TestData$Dependent3[TestData$Condition == "Yes"] <- as.numeric(data_imp_2$Dependent3)

Related Solutions

Solved – Multiple imputation for missing values

One solution is to write your own custom imputation functions for the mice package. The package is prepared for this and the setup surprisingly pain-free.

First we setup the data as suggested:

dat=data.frame(x1=c(21, 50, 31, 15, 36, 82, 14, 14, 19, 18, 16, 36, 583, NA,NA,NA, 50, 52, 26, 24), 
               x2=c(0, NA, 18,0, 19, 0, NA, 0, 0, 0, 0, 0, 0,NA,NA, NA, 22, NA, 0, 0), 
               x3=c(0, 0, 0, 0, 0, 54, 0 ,0, 0, 0, 0, 0, 0, NA, NA, NA, NA, 0, 0, 0))

Next we load the micepackage and see what methods it choose by default:

library(mice)
# Do a non-imputation
imp_base <- mice(dat, m=0, maxit = 0)

# Find the methods that mice chooses
imp_base$method
# Returns: "pmm" "pmm" "pmm"

# Look at the imputation matrix
imp_base$predictorMatrix
# Returns:
#   x1 x2 x3
#x1  0  1  1
#x2  1  0  1
#x3  1  1  0

The pmm stands for predictive mean matching - probably the most popular imputation algorithm for imputing continuous variables. It calculates the predicted value using a regression model and picks the 5 closest elements to the predicted value (by Euclidean distance). These chosen elements are called the donor pool and the final value is chosen at random from this donor pool.

From the prediction matrix we find that the methods get the variables passed that are of interest for the restrictions. Note that the row is the target variable and the column the predictors. If x1 did not have 1 in the x3 column we would have to add this in the matrix: imp_base$predictorMatrix["x1","x3"] <- 1

Now to the fun part, generating the imputation methods. I've chosen a rather crude method here where I discard all values if they don't meet the criteria. This may result in long loop time and it may potentially be more efficient to keep the valid imputations and only redo the remaining ones, it would require a little more tweaking though.

# Generate our custom methods
mice.impute.pmm_x1 <- 
  function (y, ry, x, donors = 5, type = 1, ridge = 1e-05, version = "", 
            ...) 
  {
    max_sum <- sum(max(x[,"x2"], na.rm=TRUE),
                   max(x[,"x3"], na.rm=TRUE))
    repeat{
      vals <- mice.impute.pmm(y, ry, x, donors = 5, type = 1, ridge = 1e-05,
                              version = "", ...)
      if (all(vals < max_sum)){
        break
      }
    }
    return(vals)
  }

mice.impute.pmm_x2 <- 
  function (y, ry, x, donors = 5, type = 1, ridge = 1e-05, version = "", 
            ...) 
  {
    repeat{
      vals <- mice.impute.pmm(y, ry, x, donors = 5, type = 1, ridge = 1e-05,
                              version = "", ...)
      if (all(vals == 0 | vals >= 14)){
        break
      }
    }
    return(vals)
  }

mice.impute.pmm_x3 <- 
  function (y, ry, x, donors = 5, type = 1, ridge = 1e-05, version = "", 
            ...) 
  {
    repeat{
      vals <- mice.impute.pmm(y, ry, x, donors = 5, type = 1, ridge = 1e-05,
                              version = "", ...)
      if (all(vals == 0 | vals >= 16)){
        break
      }
    }
    return(vals)
  }

Once we are done defining the methods we simple change the previous methods. If you only want to change a single variable then you can simply use imp_base$method["x2"] <- "pmm_x2" but for this example we will change all (the naming is not necessary):

imp_base$method <- c(x1 = "pmm_x1", x2 = "pmm_x2", x3 = "pmm_x3")

# The predictor matrix is not really necessary for this example
# but I use it just to illustrate in case you would like to 
# modify it
imp_ds <- 
  mice(dat, 
       method = imp_base$method, 
       predictorMatrix = imp_base$predictorMatrix)

Now let's have a look at the third imputed dataset:

> complete(imp_ds, action = 3)
    x1 x2 x3
1   21  0  0
2   50 19  0
3   31 18  0
4   15  0  0
5   36 19  0
6   82  0 54
7   14  0  0
8   14  0  0
9   19  0  0
10  18  0  0
11  16  0  0
12  36  0  0
13 583  0  0
14  50 22  0
15  52 19  0
16  14  0  0
17  50 22  0
18  52  0  0
19  26  0  0
20  24  0  0

Ok, that does the job. I like this solution as you can piggyback on top of mainstream functions and just add the restrictions that you find meaningful.

Update

In order to enforce the rigorous restraints @t0x1n mentioned in the comments, we may want to add the following abilities to the wrapper function:

Save valid values during the loops so that data from previous, partially successful runs is not discarded
An escape mechanism in order to avoid infinite loops
Inflate the donor pool after trying x times without finding a suitable match (this primarily applies to pmm)

This results in a slightly more complicated wrapper function:

mice.impute.pmm_x1_adv <-   function (y, ry, 
                                      x, donors = 5, 
                                      type = 1, ridge = 1e-05, 
                                      version = "", ...) {
  # The mice:::remove.lindep may remove the parts required for
  # the test - in those cases we should escape the test
  if (!all(c("x2", "x3") %in% colnames(x))){
    warning("Could not enforce pmm_x1 due to missing column(s):",
            c("x2", "x3")[!c("x2", "x3") %in% colnames(x)])
    return(mice.impute.pmm(y, ry, x, donors = 5, type = 1, ridge = 1e-05,
                           version = "", ...))
  }

  # Select those missing
  max_vals <- rowSums(x[!ry, c("x2", "x3")])

  # We will keep saving the valid values in the valid_vals
  valid_vals <- rep(NA, length.out = sum(!ry))
  # We need a counter in order to avoid an eternal loop
  # and for inflating the donor pool if no match is found
  cntr <- 0
  repeat{
    # We should be prepared to increase the donor pool, otherwise
    # the criteria may become imposs
    donor_inflation <- floor(cntr/10)
    vals <- mice.impute.pmm(y, ry, x, 
                            donors = min(5 + donor_inflation, sum(ry)), 
                            type = 1, ridge = 1e-05,
                            version = "", ...)

    # Our criteria check
    correct <- vals < max_vals
    if (all(!is.na(valid_vals) |
              correct)){
      valid_vals[correct] <-
        vals[correct]
      break
    }else if (any(is.na(valid_vals) &
                    correct)){
      # Save the new valid values
      valid_vals[correct] <-
        vals[correct]
    }

    # An emergency exit to avoid endless loop
    cntr <- cntr + 1
    if (cntr > 200){
      warning("Could not completely enforce constraints for ",
              sum(is.na(valid_vals)),
              " out of ",
              length(valid_vals),
              " missing elements")
      if (all(is.na(valid_vals))){
        valid_vals <- vals
      }else{
        valid_vals[is.na(valid_vals)] <- 
          vals[is.na(valid_vals)]
      }
      break
    }
  }
  return(valid_vals)
}

Note that this does not perform that well, most likely due to that the suggested data set fails the constraints for all cases without missing. I need to increase the loop length to 400-500 before it even starts to behave. I assume that this is unintentional, your imputation should mimic how the actual data is generated.

Optimization

The argument ry contains the non-missing values and we could possibly speed up the loop by removing the elements that we have found eligible imputations, but as I'm unfamiliar with the inner functions I have refrained from this.

I think the most important thing when you have strong constraints that take time to full-fill is to parallelize your imputations (see my answer on CrossValidated). Most have today computers with 4-8 cores and R only uses one of them by default. The time can be (almost) sliced in half by doubling the number of cores.

Missing parameters at imputation

Regarding the problem of x2 being missing at the time of imputation - mice actually never feeds missing values into the x-data.frame. The mice method includes filling in some random value at start. The chain-part of the imputation limits the impact from this initial value. If you look at the mice-function you can find this prior to the imputation call (the mice:::sampler-function):

...
if (method[j] != "") {
  for (i in 1:m) {
    if (nmis[j] < nrow(data)) {
      if (is.null(data.init)) {
        imp[[j]][, i] <- mice.impute.sample(y, 
                                            ry, ...)
      }
      else {
        imp[[j]][, i] <- data.init[!ry, j]
      }
    }
    else imp[[j]][, i] <- rnorm(nrow(data))
  }
}
...

The data.init can be supplied to the mice function and the mice.imput.sample is a basic sampling procedure.

Visiting sequence

If visiting sequence is important you can specify the order in which the mice-function runs the imputations. Default is from 1:ncol(data) but you can set the visitSequence to be anything you like.

Solved – How to choose which imputation to use to replace missing values

As mice works the goal is NOT to choose the best imputation (in your case out of the 5 you have performed above) for replacing the NA values in your variable.

You rather find the appropriate number of imputations and iterations and then get a pooled value. That pooled value could be a pooled coefficient for example on a regression model based on all m imputations or even the average of m imputations for replacing the NA in your variable.

The links below I think will give you the needed codes and understanding of the steps:

Codes: http://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/mi.html

Detailed PDF: https://www.jstatsoft.org/article/view/v045i03