Solved – Missing values for multiple columns

data-imputationmachine learningmissing data

I am recently working on Missing Value Imputation. The dataset I am using is Mammographic Mass data set found from here. Now, the dataset contains missing values in multiple columns. I need some ideas how I can build a model or use any technique to impute the missing values.

Best Answer

A common approach is Multivariate Imputation by Chained Equations (MICE). A paper about the topic can be found here.

There are several statistical softwares, which are able to perform MICE. Below you can find an example in R, in which I used the package mice to impute some example data.

# Example data
N <- 1000
x1 <- rnorm(N)
x2 <- x1 + rnorm(N)
x3 <- rnorm(N)
x4 <- x2 + x3 + rnorm(N)
x5 <- rnorm(N)

# Insert missings
x1[rbinom(N, 1, 0.1) == 1] <- NA
x2[rbinom(N, 1, 0.2) == 1] <- NA
x3[rbinom(N, 1, 0.05) == 1] <- NA
x4[rbinom(N, 1, 0.1) == 1] <- NA
x5[rbinom(N, 1, 0.3) == 1] <- NA

# Data with missings
data <- data.frame(x1, x2, x3, x4, x5)

# Imputation
library("mice")
imp <- mice(data, m = 1)
# m = 1 specifies a single imputation, standard would be m = 5 for multiple imputation
# The imputation method could be specified with 'method = ' - standard is pmm
# The predictor matrix could be specified with 'predictorMatrix'

# Completed data
data_imp <- complete(imp)
Related Question