Missing Data – Understanding Missing Rates and Multiple Imputation Techniques

data-imputationmissing data

Is there a limit which is the least acceptable when using multiple imputation (MI)?

For example can I use MI if the missing values in a variable are the 20% of the cases while and other variables have missing values but not to such a high level?

Best Answer

From the comments, you're confident that your in a MAR or MCAR situation. Then multiple imputation is at least reasonable. So how much missingness is tractable? Think of it this way:

Basically, multiple imputation makes all your model parameter estimates less certain as a function of the accuracy with which the missing data can be predicted with your imputation model, which will depend, among other things, on the amount of missing that needs imputing, and the number of imputations you use.

How much is 'too much' missingness therefore depends on how much added variance/uncertainty you are willing to put up with. A useful quantity for you might be the relative efficiency ($RE$) of an MI analysis. This depends on the 'fraction of missing information' (not the simple rate of missingness), usually called $\lambda$, and the number of imputations, usually called $m$, as $RE \approx 1/(1+\lambda/m)$.

Rather than generate the definitions of missing information etc. here, you might simply read the MI FAQ which puts things very clearly. From there you'll know whether you want to tackle the original sources: Rubin etc.

Practically speaking you should probably just try an imputation analysis and see how it works out.

Related Solutions

Solved – Multiple imputation for missing values

One solution is to write your own custom imputation functions for the mice package. The package is prepared for this and the setup surprisingly pain-free.

First we setup the data as suggested:

dat=data.frame(x1=c(21, 50, 31, 15, 36, 82, 14, 14, 19, 18, 16, 36, 583, NA,NA,NA, 50, 52, 26, 24), 
               x2=c(0, NA, 18,0, 19, 0, NA, 0, 0, 0, 0, 0, 0,NA,NA, NA, 22, NA, 0, 0), 
               x3=c(0, 0, 0, 0, 0, 54, 0 ,0, 0, 0, 0, 0, 0, NA, NA, NA, NA, 0, 0, 0))

Next we load the micepackage and see what methods it choose by default:

library(mice)
# Do a non-imputation
imp_base <- mice(dat, m=0, maxit = 0)

# Find the methods that mice chooses
imp_base$method
# Returns: "pmm" "pmm" "pmm"

# Look at the imputation matrix
imp_base$predictorMatrix
# Returns:
#   x1 x2 x3
#x1  0  1  1
#x2  1  0  1
#x3  1  1  0

The pmm stands for predictive mean matching - probably the most popular imputation algorithm for imputing continuous variables. It calculates the predicted value using a regression model and picks the 5 closest elements to the predicted value (by Euclidean distance). These chosen elements are called the donor pool and the final value is chosen at random from this donor pool.

From the prediction matrix we find that the methods get the variables passed that are of interest for the restrictions. Note that the row is the target variable and the column the predictors. If x1 did not have 1 in the x3 column we would have to add this in the matrix: imp_base$predictorMatrix["x1","x3"] <- 1

Now to the fun part, generating the imputation methods. I've chosen a rather crude method here where I discard all values if they don't meet the criteria. This may result in long loop time and it may potentially be more efficient to keep the valid imputations and only redo the remaining ones, it would require a little more tweaking though.

# Generate our custom methods
mice.impute.pmm_x1 <- 
  function (y, ry, x, donors = 5, type = 1, ridge = 1e-05, version = "", 
            ...) 
  {
    max_sum <- sum(max(x[,"x2"], na.rm=TRUE),
                   max(x[,"x3"], na.rm=TRUE))
    repeat{
      vals <- mice.impute.pmm(y, ry, x, donors = 5, type = 1, ridge = 1e-05,
                              version = "", ...)
      if (all(vals < max_sum)){
        break
      }
    }
    return(vals)
  }

mice.impute.pmm_x2 <- 
  function (y, ry, x, donors = 5, type = 1, ridge = 1e-05, version = "", 
            ...) 
  {
    repeat{
      vals <- mice.impute.pmm(y, ry, x, donors = 5, type = 1, ridge = 1e-05,
                              version = "", ...)
      if (all(vals == 0 | vals >= 14)){
        break
      }
    }
    return(vals)
  }

mice.impute.pmm_x3 <- 
  function (y, ry, x, donors = 5, type = 1, ridge = 1e-05, version = "", 
            ...) 
  {
    repeat{
      vals <- mice.impute.pmm(y, ry, x, donors = 5, type = 1, ridge = 1e-05,
                              version = "", ...)
      if (all(vals == 0 | vals >= 16)){
        break
      }
    }
    return(vals)
  }

Once we are done defining the methods we simple change the previous methods. If you only want to change a single variable then you can simply use imp_base$method["x2"] <- "pmm_x2" but for this example we will change all (the naming is not necessary):

imp_base$method <- c(x1 = "pmm_x1", x2 = "pmm_x2", x3 = "pmm_x3")

# The predictor matrix is not really necessary for this example
# but I use it just to illustrate in case you would like to 
# modify it
imp_ds <- 
  mice(dat, 
       method = imp_base$method, 
       predictorMatrix = imp_base$predictorMatrix)

Now let's have a look at the third imputed dataset:

> complete(imp_ds, action = 3)
    x1 x2 x3
1   21  0  0
2   50 19  0
3   31 18  0
4   15  0  0
5   36 19  0
6   82  0 54
7   14  0  0
8   14  0  0
9   19  0  0
10  18  0  0
11  16  0  0
12  36  0  0
13 583  0  0
14  50 22  0
15  52 19  0
16  14  0  0
17  50 22  0
18  52  0  0
19  26  0  0
20  24  0  0

Ok, that does the job. I like this solution as you can piggyback on top of mainstream functions and just add the restrictions that you find meaningful.

Update

In order to enforce the rigorous restraints @t0x1n mentioned in the comments, we may want to add the following abilities to the wrapper function:

Save valid values during the loops so that data from previous, partially successful runs is not discarded
An escape mechanism in order to avoid infinite loops
Inflate the donor pool after trying x times without finding a suitable match (this primarily applies to pmm)

This results in a slightly more complicated wrapper function:

mice.impute.pmm_x1_adv <-   function (y, ry, 
                                      x, donors = 5, 
                                      type = 1, ridge = 1e-05, 
                                      version = "", ...) {
  # The mice:::remove.lindep may remove the parts required for
  # the test - in those cases we should escape the test
  if (!all(c("x2", "x3") %in% colnames(x))){
    warning("Could not enforce pmm_x1 due to missing column(s):",
            c("x2", "x3")[!c("x2", "x3") %in% colnames(x)])
    return(mice.impute.pmm(y, ry, x, donors = 5, type = 1, ridge = 1e-05,
                           version = "", ...))
  }

  # Select those missing
  max_vals <- rowSums(x[!ry, c("x2", "x3")])

  # We will keep saving the valid values in the valid_vals
  valid_vals <- rep(NA, length.out = sum(!ry))
  # We need a counter in order to avoid an eternal loop
  # and for inflating the donor pool if no match is found
  cntr <- 0
  repeat{
    # We should be prepared to increase the donor pool, otherwise
    # the criteria may become imposs
    donor_inflation <- floor(cntr/10)
    vals <- mice.impute.pmm(y, ry, x, 
                            donors = min(5 + donor_inflation, sum(ry)), 
                            type = 1, ridge = 1e-05,
                            version = "", ...)

    # Our criteria check
    correct <- vals < max_vals
    if (all(!is.na(valid_vals) |
              correct)){
      valid_vals[correct] <-
        vals[correct]
      break
    }else if (any(is.na(valid_vals) &
                    correct)){
      # Save the new valid values
      valid_vals[correct] <-
        vals[correct]
    }

    # An emergency exit to avoid endless loop
    cntr <- cntr + 1
    if (cntr > 200){
      warning("Could not completely enforce constraints for ",
              sum(is.na(valid_vals)),
              " out of ",
              length(valid_vals),
              " missing elements")
      if (all(is.na(valid_vals))){
        valid_vals <- vals
      }else{
        valid_vals[is.na(valid_vals)] <- 
          vals[is.na(valid_vals)]
      }
      break
    }
  }
  return(valid_vals)
}

Note that this does not perform that well, most likely due to that the suggested data set fails the constraints for all cases without missing. I need to increase the loop length to 400-500 before it even starts to behave. I assume that this is unintentional, your imputation should mimic how the actual data is generated.

Optimization

The argument ry contains the non-missing values and we could possibly speed up the loop by removing the elements that we have found eligible imputations, but as I'm unfamiliar with the inner functions I have refrained from this.

I think the most important thing when you have strong constraints that take time to full-fill is to parallelize your imputations (see my answer on CrossValidated). Most have today computers with 4-8 cores and R only uses one of them by default. The time can be (almost) sliced in half by doubling the number of cores.

Missing parameters at imputation

Regarding the problem of x2 being missing at the time of imputation - mice actually never feeds missing values into the x-data.frame. The mice method includes filling in some random value at start. The chain-part of the imputation limits the impact from this initial value. If you look at the mice-function you can find this prior to the imputation call (the mice:::sampler-function):

...
if (method[j] != "") {
  for (i in 1:m) {
    if (nmis[j] < nrow(data)) {
      if (is.null(data.init)) {
        imp[[j]][, i] <- mice.impute.sample(y, 
                                            ry, ...)
      }
      else {
        imp[[j]][, i] <- data.init[!ry, j]
      }
    }
    else imp[[j]][, i] <- rnorm(nrow(data))
  }
}
...

The data.init can be supplied to the mice function and the mice.imput.sample is a basic sampling procedure.

Visiting sequence

If visiting sequence is important you can specify the order in which the mice-function runs the imputations. Default is from 1:ncol(data) but you can set the visitSequence to be anything you like.

Solved – using random forest for missing data imputation in categorical variables ( in R)

In looks like you are interested in multiple imputations. See this link on ways you can impute / handle categorical data. The link discuss on details and how to do this in SAS.

The R package mice can handle categorical data for univariate cases using logistic regression and discriminant function analysis (see the link). If you use SAS proc mi is way to go [see link].

Edit:

You can use the function rfunsuper used in my answer for the another question.

p = 500
n = 200
mat <- matrix(NA, ncol = 500, nrow = 200)

for (i in 1:p){
if(i ==1){
     fs <- sample (c("AA", "AB", "AB", "BB"), n, replace=TRUE) 
     mat[,i] <- fs 
     fs1 <- fs 
     } 
rechr <- sample(1:n, 1)
fs1[rechr] <- sample (c("AA", "AB", "AB", "BB"), 1)
mat[,i] <- fs1 
}

dim(mat)

rowind <- sample(1:n, 20)
colind <- sample(1:p, 20)

for (i in 1:length(rowind)){
                mat[rowind[i],colind[i]] <- NA
}
mat[1:20,1:10]

out1 <- rfunsuper (data.frame(mat), iter=5, ntree=100)
diff.rel = 1.555556 / 50587.33 = 3.07499e-05 
diff.rel = 0.1111111 / 50590.67 = 2.196277e-06 

Call:
 randomForest(x = x.roughfixed, ntree = ntree) 
               Type of random forest: unsupervised
                     Number of trees: 100
No. of variables tried at each split: 22