Is there a limit which is the least acceptable when using multiple imputation (MI)?
For example can I use MI if the missing values in a variable are the 20% of the cases while and other variables have missing values but not to such a high level?
data-imputationmissing data
Is there a limit which is the least acceptable when using multiple imputation (MI)?
For example can I use MI if the missing values in a variable are the 20% of the cases while and other variables have missing values but not to such a high level?
One solution is to write your own custom imputation functions for the mice
package. The package is prepared for this and the setup surprisingly pain-free.
First we setup the data as suggested:
dat=data.frame(x1=c(21, 50, 31, 15, 36, 82, 14, 14, 19, 18, 16, 36, 583, NA,NA,NA, 50, 52, 26, 24),
x2=c(0, NA, 18,0, 19, 0, NA, 0, 0, 0, 0, 0, 0,NA,NA, NA, 22, NA, 0, 0),
x3=c(0, 0, 0, 0, 0, 54, 0 ,0, 0, 0, 0, 0, 0, NA, NA, NA, NA, 0, 0, 0))
Next we load the mice
package and see what methods it choose by default:
library(mice)
# Do a non-imputation
imp_base <- mice(dat, m=0, maxit = 0)
# Find the methods that mice chooses
imp_base$method
# Returns: "pmm" "pmm" "pmm"
# Look at the imputation matrix
imp_base$predictorMatrix
# Returns:
# x1 x2 x3
#x1 0 1 1
#x2 1 0 1
#x3 1 1 0
The pmm
stands for predictive mean matching - probably the most popular imputation algorithm for imputing continuous variables. It calculates the predicted value using a regression model and picks the 5 closest elements to the predicted value (by Euclidean distance). These chosen elements are called the donor pool and the final value is chosen at random from this donor pool.
From the prediction matrix we find that the methods get the variables passed that are of interest for the restrictions. Note that the row is the target variable and the column the predictors. If x1 did not have 1 in the x3 column we would have to add this in the matrix: imp_base$predictorMatrix["x1","x3"] <- 1
Now to the fun part, generating the imputation methods. I've chosen a rather crude method here where I discard all values if they don't meet the criteria. This may result in long loop time and it may potentially be more efficient to keep the valid imputations and only redo the remaining ones, it would require a little more tweaking though.
# Generate our custom methods
mice.impute.pmm_x1 <-
function (y, ry, x, donors = 5, type = 1, ridge = 1e-05, version = "",
...)
{
max_sum <- sum(max(x[,"x2"], na.rm=TRUE),
max(x[,"x3"], na.rm=TRUE))
repeat{
vals <- mice.impute.pmm(y, ry, x, donors = 5, type = 1, ridge = 1e-05,
version = "", ...)
if (all(vals < max_sum)){
break
}
}
return(vals)
}
mice.impute.pmm_x2 <-
function (y, ry, x, donors = 5, type = 1, ridge = 1e-05, version = "",
...)
{
repeat{
vals <- mice.impute.pmm(y, ry, x, donors = 5, type = 1, ridge = 1e-05,
version = "", ...)
if (all(vals == 0 | vals >= 14)){
break
}
}
return(vals)
}
mice.impute.pmm_x3 <-
function (y, ry, x, donors = 5, type = 1, ridge = 1e-05, version = "",
...)
{
repeat{
vals <- mice.impute.pmm(y, ry, x, donors = 5, type = 1, ridge = 1e-05,
version = "", ...)
if (all(vals == 0 | vals >= 16)){
break
}
}
return(vals)
}
Once we are done defining the methods we simple change the previous methods. If you only want to change a single variable then you can simply use imp_base$method["x2"] <- "pmm_x2"
but for this example we will change all (the naming is not necessary):
imp_base$method <- c(x1 = "pmm_x1", x2 = "pmm_x2", x3 = "pmm_x3")
# The predictor matrix is not really necessary for this example
# but I use it just to illustrate in case you would like to
# modify it
imp_ds <-
mice(dat,
method = imp_base$method,
predictorMatrix = imp_base$predictorMatrix)
Now let's have a look at the third imputed dataset:
> complete(imp_ds, action = 3)
x1 x2 x3
1 21 0 0
2 50 19 0
3 31 18 0
4 15 0 0
5 36 19 0
6 82 0 54
7 14 0 0
8 14 0 0
9 19 0 0
10 18 0 0
11 16 0 0
12 36 0 0
13 583 0 0
14 50 22 0
15 52 19 0
16 14 0 0
17 50 22 0
18 52 0 0
19 26 0 0
20 24 0 0
Ok, that does the job. I like this solution as you can piggyback on top of mainstream functions and just add the restrictions that you find meaningful.
In order to enforce the rigorous restraints @t0x1n mentioned in the comments, we may want to add the following abilities to the wrapper function:
This results in a slightly more complicated wrapper function:
mice.impute.pmm_x1_adv <- function (y, ry,
x, donors = 5,
type = 1, ridge = 1e-05,
version = "", ...) {
# The mice:::remove.lindep may remove the parts required for
# the test - in those cases we should escape the test
if (!all(c("x2", "x3") %in% colnames(x))){
warning("Could not enforce pmm_x1 due to missing column(s):",
c("x2", "x3")[!c("x2", "x3") %in% colnames(x)])
return(mice.impute.pmm(y, ry, x, donors = 5, type = 1, ridge = 1e-05,
version = "", ...))
}
# Select those missing
max_vals <- rowSums(x[!ry, c("x2", "x3")])
# We will keep saving the valid values in the valid_vals
valid_vals <- rep(NA, length.out = sum(!ry))
# We need a counter in order to avoid an eternal loop
# and for inflating the donor pool if no match is found
cntr <- 0
repeat{
# We should be prepared to increase the donor pool, otherwise
# the criteria may become imposs
donor_inflation <- floor(cntr/10)
vals <- mice.impute.pmm(y, ry, x,
donors = min(5 + donor_inflation, sum(ry)),
type = 1, ridge = 1e-05,
version = "", ...)
# Our criteria check
correct <- vals < max_vals
if (all(!is.na(valid_vals) |
correct)){
valid_vals[correct] <-
vals[correct]
break
}else if (any(is.na(valid_vals) &
correct)){
# Save the new valid values
valid_vals[correct] <-
vals[correct]
}
# An emergency exit to avoid endless loop
cntr <- cntr + 1
if (cntr > 200){
warning("Could not completely enforce constraints for ",
sum(is.na(valid_vals)),
" out of ",
length(valid_vals),
" missing elements")
if (all(is.na(valid_vals))){
valid_vals <- vals
}else{
valid_vals[is.na(valid_vals)] <-
vals[is.na(valid_vals)]
}
break
}
}
return(valid_vals)
}
Note that this does not perform that well, most likely due to that the suggested data set fails the constraints for all cases without missing. I need to increase the loop length to 400-500 before it even starts to behave. I assume that this is unintentional, your imputation should mimic how the actual data is generated.
The argument ry
contains the non-missing values and we could possibly speed up the loop by removing the elements that we have found eligible imputations, but as I'm unfamiliar with the inner functions I have refrained from this.
I think the most important thing when you have strong constraints that take time to full-fill is to parallelize your imputations (see my answer on CrossValidated). Most have today computers with 4-8 cores and R only uses one of them by default. The time can be (almost) sliced in half by doubling the number of cores.
Regarding the problem of x2
being missing at the time of imputation - mice actually never feeds missing values into the x
-data.frame
. The mice method includes filling in some random value at start. The chain-part of the imputation limits the impact from this initial value. If you look at the mice
-function you can find this prior to the imputation call (the mice:::sampler
-function):
...
if (method[j] != "") {
for (i in 1:m) {
if (nmis[j] < nrow(data)) {
if (is.null(data.init)) {
imp[[j]][, i] <- mice.impute.sample(y,
ry, ...)
}
else {
imp[[j]][, i] <- data.init[!ry, j]
}
}
else imp[[j]][, i] <- rnorm(nrow(data))
}
}
...
The data.init
can be supplied to the mice
function and the mice.imput.sample is a basic sampling procedure.
If visiting sequence is important you can specify the order in which the mice
-function runs the imputations. Default is from 1:ncol(data)
but you can set the visitSequence
to be anything you like.
In looks like you are interested in multiple imputations. See this link on ways you can impute / handle categorical data. The link discuss on details and how to do this in SAS.
The R package mice
can handle categorical data for univariate cases using logistic regression and discriminant function analysis (see the link). If you use SAS proc mi
is way to go [see link].
Edit:
You can use the function rfunsuper
used in my answer for the another question.
p = 500
n = 200
mat <- matrix(NA, ncol = 500, nrow = 200)
for (i in 1:p){
if(i ==1){
fs <- sample (c("AA", "AB", "AB", "BB"), n, replace=TRUE)
mat[,i] <- fs
fs1 <- fs
}
rechr <- sample(1:n, 1)
fs1[rechr] <- sample (c("AA", "AB", "AB", "BB"), 1)
mat[,i] <- fs1
}
dim(mat)
rowind <- sample(1:n, 20)
colind <- sample(1:p, 20)
for (i in 1:length(rowind)){
mat[rowind[i],colind[i]] <- NA
}
mat[1:20,1:10]
out1 <- rfunsuper (data.frame(mat), iter=5, ntree=100)
diff.rel = 1.555556 / 50587.33 = 3.07499e-05
diff.rel = 0.1111111 / 50590.67 = 2.196277e-06
Call:
randomForest(x = x.roughfixed, ntree = ntree)
Type of random forest: unsupervised
Number of trees: 100
No. of variables tried at each split: 22
Best Answer
From the comments, you're confident that your in a MAR or MCAR situation. Then multiple imputation is at least reasonable. So how much missingness is tractable? Think of it this way:
Basically, multiple imputation makes all your model parameter estimates less certain as a function of the accuracy with which the missing data can be predicted with your imputation model, which will depend, among other things, on the amount of missing that needs imputing, and the number of imputations you use.
How much is 'too much' missingness therefore depends on how much added variance/uncertainty you are willing to put up with. A useful quantity for you might be the relative efficiency ($RE$) of an MI analysis. This depends on the 'fraction of missing information' (not the simple rate of missingness), usually called $\lambda$, and the number of imputations, usually called $m$, as $RE \approx 1/(1+\lambda/m)$.
Rather than generate the definitions of missing information etc. here, you might simply read the MI FAQ which puts things very clearly. From there you'll know whether you want to tackle the original sources: Rubin etc.
Practically speaking you should probably just try an imputation analysis and see how it works out.