Solved – How to improve running time for R MICE data imputation

micemultiple-imputationr

My question in short: are there methods to improve on the running time of R MICE (data imputation)?

I'm working with a data set (30 variables, 1.3 million rows) which contains (quite randomly) missing data. About 8% of the observations in about 15 out of 30 variables contain NAs. In order to impute the missing data, I'm running the MICE function, part of the MICE package.

I experience quite slow running time, even on a subset (100,000 rows), with method="fastpmm" and m=1 and runs for about 15 minutes.

Is there a way to improve on running time without losing too much in performance? (mice.impute.mean is quite fast, but comes with important loss of information!).

Reproducible code:

library(mice)
df <- data.frame(replicate(30,sample(c(NA,1:10),1000000,rep=TRUE)))
df <- data.frame(scale(df))

output <- mice(df, m=1, method = "fastpmm")

Best Answer

You can use quickpred() from mice package using which you can limit the predictors by specifying the mincor (Minimum correlation) and minpuc (proportion of usable cases). Also you can use the exclude and include parameters for controlling the predictors.

Related Solutions

Solved – Multiple Imputation and “Conditional Missing” values

I suggest to impute in 2 steps. Impute Condition first and then impute Dependent1-3 for the subset of Condition = Yes.

# I extended your data a bit, to get more cases.
TestData <- data.frame(Condition= c(1,1,1,1,2,NA,2,2, sample(1:2, 500, 1)), 
                   Dependent1=c(1,NA,2,3,NA,NA,NA,NA, sample(1:3, 500, 1)),
                   Dependent2=c(1,12,44,1,NA,NA,NA,NA, sample(1:44, 500, 1)),
                   Dependent3=c(NA,2,3,5,NA,NA,NA,NA, sample(1:5, 500, 1)), 
                   UnaffiliatedQ=c(1,NA,3,2,27,NA,32,35, sample(1:35, 500, 1)))
TestData[10:25, 1:4] <- NA

TestData$Condition <- factor(TestData$Condition,
                         levels = c(1,2),
                         labels = c("Yes","No"))



library("mice")

# Step 1: Imputation of condition
imp <- mice(TestData[ , c(1, 5)], m = 1)
data_imp <- complete(imp)

TestData$Condition <- as.factor(as.character(data_imp$Condition))
TestData$UnaffiliatedQ <- as.numeric(data_imp$UnaffiliatedQ)

# Subset of TestData (Condition == Yes)
TestData_sub <- TestData[TestData$Condition == "Yes", ]

# Step 2: Imputation of Dependent1-3
imp_2 <- mice(TestData_sub, m = 1)
data_imp_2 <- complete(imp_2)

TestData$Dependent1[TestData$Condition == "Yes"] <- as.numeric(data_imp_2$Dependent1)
TestData$Dependent2[TestData$Condition == "Yes"] <- as.numeric(data_imp_2$Dependent2)
TestData$Dependent3[TestData$Condition == "Yes"] <- as.numeric(data_imp_2$Dependent3)

Solved – Multiple imputation for missing data in longitudinal study

Multiple imputation is an appropriate approach for your situation but you need to account for the multilevel nature of your data. The observations are nested within participants and this fact needs to be considered when making the imputations. So you will need to select a multilevel imputation method. MICE offers several such methods and they all begin with 2L dot (e.g., 2l.norm). You can read more about the approach in this article: "Multilevel multiple imputation: A review and evaluation of joint modeling and chained equations imputation" http://doi.org/10.1037/met0000063 and you can read more about implementing it in MICE in this article: "Multivariate imputation by chained equations" http://doc.utwente.nl/78938/1/Buuren11mice.pdf

Best Answer

Related Solutions

Solved – Multiple Imputation and “Conditional Missing” values

Solved – Multiple imputation for missing data in longitudinal study

Related Question