Solved – caret preProcess knnImpute error more nearest neighbours than there are points

caretdata preprocessingdata-imputationr

I am trying to impute missing data using preProcess function in caret with kNNImpute method.

library(missForest)
data(iris)
## Introduce large missing values to the iris data set
set.seed(752)
iris.mis = iris
iris.mis[, c(1,3)] <- prodNA(iris[, c(1,3)], 0.95)
summary(iris.mis)

myK = min(unlist(lapply(iris.mis, function(x){150-sum(is.na(x))}))) - 1

preProcValues <- preProcess(iris.mis[, -4], method = c("knnImpute"), k = myK)
t_imp <- predict(preProcValues, iris.mis[, -4])

However, I got the error:

Error in RANN::nn2(old[, non_missing_cols, drop = FALSE], new[, non_missing_cols, :
Cannot find more nearest neighbours than there are points

Is this method not suitable for large missing data?

Best Answer

The problem you run into is that knnImpute requires at least as many samples in your data without missing values as you have specified with the k parameter for the k-nearest-neighbours. As you use prodNA, you distribute NA randomly - which with a noNA=0.95 being pretty high, just very likely turns out to not have sufficient samples without NA values for your k:

table(apply(iris.mis, 1, function(r) all(!(is.na(r)))))
# FALSE 
#   150 

What would work:

# slightly reduce amount of NA
iris.mis[, c(1,3)] <- prodNA(iris[, c(1,3)], 0.8)
table(apply(iris.mis, 1, function(r) all(!(is.na(r)))))
# FALSE  TRUE 
#   147     3 

# use at max the amount of samples without NA as k
myK = sum(apply(iris.mis, 1, function(r) all(!is.na(r))))

# impute as done in the question
preProcValues <- preProcess(iris.mis[, -4], method = c("knnImpute"), k = myK)
t_imp <- predict(preProcValues, iris.mis[, -4])

So, bottom line, knnImpute does work with many values missing - but if you want to use it with such few samples will depend on your problem and goal.

One more thing: keep in mind that if you would (be able to) use samples for imptutation that have certain features set to NA themselves this boils down to looking at samples in a subspace of your features. For example, in the extreme case of looking at and imputing 1 feature at a time only, you would not use any other information about a sample that has NA vaues to impute those. This would therefore not be classic imputation anymore which e.g. knnImpute is designed for.