I would like to use imputation for replacing missing values in my data set under certain constraints.
For example, I'd like the imputed variable x1
to be greater or equal to the sum of my two other variables, say x2
and x3
. I also want x3
to be imputed by either 0
or >= 14
and I want x2
to be imputed by either 0
or >= 16
.
I tried defining these constraints in SPSS for multiple imputation, but in SPSS I can only define maximum and minimum values. Is there any way to define further constraints in SPSS or do you know any R package which would let me define such constraints for imputation of missing values?
My data is as follows:
x1 =c(21, 50, 31, 15, 36, 82, 14, 14, 19, 18, 16, 36, 583, NA,NA,NA, 50, 52, 26, 24)
x2 = c(0, NA, 18,0, 19, 0, NA, 0, 0, 0, 0, 0, 0,NA,NA, NA, 22, NA, 0, 0)
x3 = c(0, 0, 0, 0, 0, 54, 0 ,0, 0, 0, 0, 0, 0, NA, NA, NA, NA, 0, 0, 0)
dat=data.frame(x1=x1, x2=x2, x3=x3)
> dat
x1 x2 x3
1 21 0 0
2 50 NA 0
3 31 18 0
4 15 0 0
5 36 19 0
6 82 0 54
7 14 NA 0
8 14 0 0
9 19 0 0
10 18 0 0
11 16 0 0
12 36 0 0
13 583 0 0
14 NA NA NA
15 NA NA NA
16 NA NA NA
17 50 22 NA
18 52 NA 0
19 26 0 0
20 24 0 0
Best Answer
One solution is to write your own custom imputation functions for the
mice
package. The package is prepared for this and the setup surprisingly pain-free.First we setup the data as suggested:
Next we load the
mice
package and see what methods it choose by default:The
pmm
stands for predictive mean matching - probably the most popular imputation algorithm for imputing continuous variables. It calculates the predicted value using a regression model and picks the 5 closest elements to the predicted value (by Euclidean distance). These chosen elements are called the donor pool and the final value is chosen at random from this donor pool.From the prediction matrix we find that the methods get the variables passed that are of interest for the restrictions. Note that the row is the target variable and the column the predictors. If x1 did not have 1 in the x3 column we would have to add this in the matrix:
imp_base$predictorMatrix["x1","x3"] <- 1
Now to the fun part, generating the imputation methods. I've chosen a rather crude method here where I discard all values if they don't meet the criteria. This may result in long loop time and it may potentially be more efficient to keep the valid imputations and only redo the remaining ones, it would require a little more tweaking though.
Once we are done defining the methods we simple change the previous methods. If you only want to change a single variable then you can simply use
imp_base$method["x2"] <- "pmm_x2"
but for this example we will change all (the naming is not necessary):Now let's have a look at the third imputed dataset:
Ok, that does the job. I like this solution as you can piggyback on top of mainstream functions and just add the restrictions that you find meaningful.
Update
In order to enforce the rigorous restraints @t0x1n mentioned in the comments, we may want to add the following abilities to the wrapper function:
This results in a slightly more complicated wrapper function:
Note that this does not perform that well, most likely due to that the suggested data set fails the constraints for all cases without missing. I need to increase the loop length to 400-500 before it even starts to behave. I assume that this is unintentional, your imputation should mimic how the actual data is generated.
Optimization
The argument
ry
contains the non-missing values and we could possibly speed up the loop by removing the elements that we have found eligible imputations, but as I'm unfamiliar with the inner functions I have refrained from this.I think the most important thing when you have strong constraints that take time to full-fill is to parallelize your imputations (see my answer on CrossValidated). Most have today computers with 4-8 cores and R only uses one of them by default. The time can be (almost) sliced in half by doubling the number of cores.
Missing parameters at imputation
Regarding the problem of
x2
being missing at the time of imputation - mice actually never feeds missing values into thex
-data.frame
. The mice method includes filling in some random value at start. The chain-part of the imputation limits the impact from this initial value. If you look at themice
-function you can find this prior to the imputation call (themice:::sampler
-function):The
data.init
can be supplied to themice
function and the mice.imput.sample is a basic sampling procedure.Visiting sequence
If visiting sequence is important you can specify the order in which the
mice
-function runs the imputations. Default is from1:ncol(data)
but you can set thevisitSequence
to be anything you like.