Solved – How to improve running time for R MICE data imputation

micemultiple-imputationr

My question in short: are there methods to improve on the running time of R MICE (data imputation)?

I'm working with a data set (30 variables, 1.3 million rows) which contains (quite randomly) missing data. About 8% of the observations in about 15 out of 30 variables contain NAs. In order to impute the missing data, I'm running the MICE function, part of the MICE package.

I experience quite slow running time, even on a subset (100,000 rows), with method="fastpmm" and m=1 and runs for about 15 minutes.

Is there a way to improve on running time without losing too much in performance? (mice.impute.mean is quite fast, but comes with important loss of information!).

Reproducible code:

library(mice)
df <- data.frame(replicate(30,sample(c(NA,1:10),1000000,rep=TRUE)))
df <- data.frame(scale(df))

output <- mice(df, m=1, method = "fastpmm")

Best Answer

You can use quickpred() from mice package using which you can limit the predictors by specifying the mincor (Minimum correlation) and minpuc (proportion of usable cases). Also you can use the exclude and include parameters for controlling the predictors.

Related Question