I am trying to impute missing data using preProcess function in caret with kNNImpute method.
library(missForest)
data(iris)
## Introduce large missing values to the iris data set
set.seed(752)
iris.mis = iris
iris.mis[, c(1,3)] <- prodNA(iris[, c(1,3)], 0.95)
summary(iris.mis)
myK = min(unlist(lapply(iris.mis, function(x){150-sum(is.na(x))}))) - 1
preProcValues <- preProcess(iris.mis[, -4], method = c("knnImpute"), k = myK)
t_imp <- predict(preProcValues, iris.mis[, -4])
However, I got the error:
Error in RANN::nn2(old[, non_missing_cols, drop = FALSE], new[, non_missing_cols, :
Cannot find more nearest neighbours than there are points
Is this method not suitable for large missing data?
Best Answer
The problem you run into is that
knnImpute
requires at least as many samples in your data without missing values as you have specified with thek
parameter for the k-nearest-neighbours. As you useprodNA
, you distributeNA
randomly - which with anoNA=0.95
being pretty high, just very likely turns out to not have sufficient samples withoutNA
values for yourk
:What would work:
So, bottom line,
knnImpute
does work with many values missing - but if you want to use it with such few samples will depend on your problem and goal.One more thing: keep in mind that if you would (be able to) use samples for imptutation that have certain features set to
NA
themselves this boils down to looking at samples in a subspace of your features. For example, in the extreme case of looking at and imputing 1 feature at a time only, you would not use any other information about a sample that hasNA
vaues to impute those. This would therefore not be classic imputation anymore which e.g.knnImpute
is designed for.