Solved – K-nearest neighbour imputation of missing values

data-imputationk nearest neighbourmachine learningMATLABmissing data

I have a dataset where the columns correspond to features and the rows correspond to data points. I have around 5'000 data points and 8 features. Now, I would like to impute the missing values with the nearest neighbour method. For this I'm using the Matlab function knnimpute.

Let's say feature 4 of row 10 has a missing value. Should I search the nearest data points (rows) or the nearest columns? I tend to search the nearest data points because I want the the feature value of a closest data point. I think in this case I have to call knnimpute(data'), i.e. transposed.

Of course there is the possibility that a whole row has only missing values (or more than 50% missing values). I think Matlab does no imputation if a whole row has only missing values.

Is there a rule what to do if a whole row has only missing values? And what should I do if there are e.g. more than 50% missing values in a row?

Best Answer

Should I search the nearest data points (rows) or the nearest columns?
Yes indeed. You should search for the nearest point (i.e. row) and impute the missing value in feature j using the jth feature from the nearest neighbours. I don't know why in knnimpute() Matlab works by columns, in that case is indeed correct to transpose the dataset.
Is there a rule what to do if a whole row has only missing values? And what should I do if there are e.g. more than 50% missing values in a row?
Well if a whole row (or better, column) has only missing value the knnimpute() will must certainly fail. You can fill such column with a given value, let's say 0, so it doesn't affect the dissimilarity measure. If a given row (column) has instead a lot of missing values and you don't want (or you can't) use knnimpute() you can implement your very own imputation technique. A standard technique is the mean of the column itself (counting only non-missing values, of course and you can easily do it in Matlab thanks to the nanmean() function). On StackOverflow I posted an answer, which you can find here, regarding several missing data imputation techniques. Maybe you can read this and choose for a nice technique in such drastic scenario.

Related Solutions

Solved – caret preProcess knnImpute error more nearest neighbours than there are points

The problem you run into is that knnImpute requires at least as many samples in your data without missing values as you have specified with the k parameter for the k-nearest-neighbours. As you use prodNA, you distribute NA randomly - which with a noNA=0.95 being pretty high, just very likely turns out to not have sufficient samples without NA values for your k:

table(apply(iris.mis, 1, function(r) all(!(is.na(r)))))
# FALSE 
#   150

What would work:

# slightly reduce amount of NA
iris.mis[, c(1,3)] <- prodNA(iris[, c(1,3)], 0.8)
table(apply(iris.mis, 1, function(r) all(!(is.na(r)))))
# FALSE  TRUE 
#   147     3 

# use at max the amount of samples without NA as k
myK = sum(apply(iris.mis, 1, function(r) all(!is.na(r))))

# impute as done in the question
preProcValues <- preProcess(iris.mis[, -4], method = c("knnImpute"), k = myK)
t_imp <- predict(preProcValues, iris.mis[, -4])

So, bottom line, knnImpute does work with many values missing - but if you want to use it with such few samples will depend on your problem and goal.

One more thing: keep in mind that if you would (be able to) use samples for imptutation that have certain features set to NA themselves this boils down to looking at samples in a subspace of your features. For example, in the extreme case of looking at and imputing 1 feature at a time only, you would not use any other information about a sample that has NA vaues to impute those. This would therefore not be classic imputation anymore which e.g. knnImpute is designed for.

Solved – Missing values for multiple columns

A common approach is Multivariate Imputation by Chained Equations (MICE). A paper about the topic can be found here.

There are several statistical softwares, which are able to perform MICE. Below you can find an example in R, in which I used the package mice to impute some example data.

# Example data
N <- 1000
x1 <- rnorm(N)
x2 <- x1 + rnorm(N)
x3 <- rnorm(N)
x4 <- x2 + x3 + rnorm(N)
x5 <- rnorm(N)

# Insert missings
x1[rbinom(N, 1, 0.1) == 1] <- NA
x2[rbinom(N, 1, 0.2) == 1] <- NA
x3[rbinom(N, 1, 0.05) == 1] <- NA
x4[rbinom(N, 1, 0.1) == 1] <- NA
x5[rbinom(N, 1, 0.3) == 1] <- NA

# Data with missings
data <- data.frame(x1, x2, x3, x4, x5)

# Imputation
library("mice")
imp <- mice(data, m = 1)
# m = 1 specifies a single imputation, standard would be m = 5 for multiple imputation
# The imputation method could be specified with 'method = ' - standard is pmm
# The predictor matrix could be specified with 'predictorMatrix'

# Completed data
data_imp <- complete(imp)

Best Answer

Related Solutions

Solved – caret preProcess knnImpute error more nearest neighbours than there are points

Solved – Missing values for multiple columns

Related Question