Solved – Missing values for multiple columns

data-imputationmachine learningmissing data

I am recently working on Missing Value Imputation. The dataset I am using is Mammographic Mass data set found from here. Now, the dataset contains missing values in multiple columns. I need some ideas how I can build a model or use any technique to impute the missing values.

Best Answer

A common approach is Multivariate Imputation by Chained Equations (MICE). A paper about the topic can be found here.

There are several statistical softwares, which are able to perform MICE. Below you can find an example in R, in which I used the package mice to impute some example data.

# Example data
N <- 1000
x1 <- rnorm(N)
x2 <- x1 + rnorm(N)
x3 <- rnorm(N)
x4 <- x2 + x3 + rnorm(N)
x5 <- rnorm(N)

# Insert missings
x1[rbinom(N, 1, 0.1) == 1] <- NA
x2[rbinom(N, 1, 0.2) == 1] <- NA
x3[rbinom(N, 1, 0.05) == 1] <- NA
x4[rbinom(N, 1, 0.1) == 1] <- NA
x5[rbinom(N, 1, 0.3) == 1] <- NA

# Data with missings
data <- data.frame(x1, x2, x3, x4, x5)

# Imputation
library("mice")
imp <- mice(data, m = 1)
# m = 1 specifies a single imputation, standard would be m = 5 for multiple imputation
# The imputation method could be specified with 'method = ' - standard is pmm
# The predictor matrix could be specified with 'predictorMatrix'

# Completed data
data_imp <- complete(imp)

Related Solutions

Solved – How to impute an ordinal variable with MICE but prevent it from taking one value

The following code defines and calls a dedicated imputation function that separates imputation of cases with tumor_size == 0 from tumor_size > 0.

## How to impute an ordinal variable with MICE but prevent it from taking one value?

df <- data.frame(age = c(24,37,58,65,70,84),
                 overall_tumor_grade = c(1,1,2,3,'X',NA),
                 tumor_size = c(1.5,2.0,4.2,5.6,0,0.1))

mice.impute.tumor <- function(y, ry, x, ...){
    ymis <- y[!ry]
    tmis <- x$tumor_size[!ry] > 0
        t  <- x$tumor_size > 0
    y[!ry] <- NA
    ymis[!tmis] <- "X"
    ymis[tmis] <- mice.impute.polyreg(y[t, drop = TRUE], ry[t], x[t,], ...)
    ymis
}

ini <- mice(df, maxit = 0)
meth <- ini$meth
meth["overall_tumor_grade"] <- "tumor"
imp <- mice(df, meth = meth, maxit = 1, m = 2)

Solved – K-nearest neighbour imputation of missing values

Should I search the nearest data points (rows) or the nearest columns?
Yes indeed. You should search for the nearest point (i.e. row) and impute the missing value in feature j using the jth feature from the nearest neighbours. I don't know why in knnimpute() Matlab works by columns, in that case is indeed correct to transpose the dataset.
Is there a rule what to do if a whole row has only missing values? And what should I do if there are e.g. more than 50% missing values in a row?
Well if a whole row (or better, column) has only missing value the knnimpute() will must certainly fail. You can fill such column with a given value, let's say 0, so it doesn't affect the dissimilarity measure. If a given row (column) has instead a lot of missing values and you don't want (or you can't) use knnimpute() you can implement your very own imputation technique. A standard technique is the mean of the column itself (counting only non-missing values, of course and you can easily do it in Matlab thanks to the nanmean() function). On StackOverflow I posted an answer, which you can find here, regarding several missing data imputation techniques. Maybe you can read this and choose for a nice technique in such drastic scenario.

Best Answer

Related Solutions

Solved – How to impute an ordinal variable with MICE but prevent it from taking one value

Solved – K-nearest neighbour imputation of missing values

Related Question