Solved – caret preProcess knnImpute error more nearest neighbours than there are points

caretdata preprocessingdata-imputationr

I am trying to impute missing data using preProcess function in caret with kNNImpute method.

library(missForest)
data(iris)
## Introduce large missing values to the iris data set
set.seed(752)
iris.mis = iris
iris.mis[, c(1,3)] <- prodNA(iris[, c(1,3)], 0.95)
summary(iris.mis)

myK = min(unlist(lapply(iris.mis, function(x){150-sum(is.na(x))}))) - 1

preProcValues <- preProcess(iris.mis[, -4], method = c("knnImpute"), k = myK)
t_imp <- predict(preProcValues, iris.mis[, -4])

However, I got the error:

Error in RANN::nn2(old[, non_missing_cols, drop = FALSE], new[, non_missing_cols, :
Cannot find more nearest neighbours than there are points

Is this method not suitable for large missing data?

Best Answer

The problem you run into is that knnImpute requires at least as many samples in your data without missing values as you have specified with the k parameter for the k-nearest-neighbours. As you use prodNA, you distribute NA randomly - which with a noNA=0.95 being pretty high, just very likely turns out to not have sufficient samples without NA values for your k:

table(apply(iris.mis, 1, function(r) all(!(is.na(r)))))
# FALSE 
#   150

What would work:

# slightly reduce amount of NA
iris.mis[, c(1,3)] <- prodNA(iris[, c(1,3)], 0.8)
table(apply(iris.mis, 1, function(r) all(!(is.na(r)))))
# FALSE  TRUE 
#   147     3 

# use at max the amount of samples without NA as k
myK = sum(apply(iris.mis, 1, function(r) all(!is.na(r))))

# impute as done in the question
preProcValues <- preProcess(iris.mis[, -4], method = c("knnImpute"), k = myK)
t_imp <- predict(preProcValues, iris.mis[, -4])

So, bottom line, knnImpute does work with many values missing - but if you want to use it with such few samples will depend on your problem and goal.

One more thing: keep in mind that if you would (be able to) use samples for imptutation that have certain features set to NA themselves this boils down to looking at samples in a subspace of your features. For example, in the extreme case of looking at and imputing 1 feature at a time only, you would not use any other information about a sample that has NA vaues to impute those. This would therefore not be classic imputation anymore which e.g. knnImpute is designed for.

Related Solutions

Solved – Feature selection with Caret for data with more than one target

I don't think caret supports multi-task learning in any of its functions. You could try the glmnet package, with distribution set to mgaussian. This will allow you to do feature selection via lasso regularization, ridge regularization, or elastic net regularization for a linear regression model.

There may be other R machine learning libraries with built-in feature selection that support multi-task learning. Here's some sample code for multi-task learning, using the lasso for variable selection, adapted from ?glmnet:

#Create a dataset
set.seed(42)
library(glmnet)
x=matrix(rnorm(100*20),100,20)
cf <- sample(0:1, 20, replace=TRUE) #Select some columns
response1 <- x %*% cf*runif(20) #Apply random coefficients
response2 <- x %*% cf*runif(20)
y=cbind(response1, response2)

#Fit a single lasso model
#0 for ridge
#1 for lasso
#>0 & <1 for the elastic net (mix of ridge and lasoo)
fit1m=glmnet(x,y,family="mgaussian",alpha=1)
plot(fit1m,type.coef="2norm")

#Select lambda through cross validation
fit1m.cv <- cv.glmnet(x,y,family="mgaussian",alpha=1) 
plot(fit1m.cv)
coef(fit1m.cv) #Show coefficients at the selected value of lambda

Solved – Data imputation with preProcess in caret returns less observations than expected

preProcess does not return values, it simply sets up the whole imputation model based on the provided data. So, you need to run predict (requiring also the RANN package), but even if you do so with your artificial data you'll get an error:

Error in FUN(newX[, i], ...) : cannot impute when all predictors are missing in the new data point

as the imputation can not work in rows where both your predictors are NA's.

Here's a demonstration with only 20 rows, for clarity and easy inspection:

library(caret)

t <- data.frame(seq_len(20),seq_len(20))

for (i in 1:20) 
{
  if (i %% 3 == 0) t[i,1] <- NA; 
  if (i %% 7 == 0) t[i,2] <- NA 
}

names(t) <- c('V1', 'V2')

preProcValues <- preProcess(t, method = c("knnImpute"))

library(RANN)

t_imp <- predict(preProcValues, t)

When viewing the result, keep in mind that methods "center", "scale" have been automaticaly added to your preprocessing, even if you did not invoke them explicitly:

> str(preProcValues)
List of 19
$ call      : language preProcess.default(x = t, method = c("knnImpute"))
$ dim       : int [1:2] 12 2
$ bc        : NULL
$ yj        : NULL
$ et        : NULL
$ mean      : Named num [1:2] 10.5 10.5
 ..- attr(*, "names")= chr [1:2] "V1" "V2"
$ std       : Named num [1:2] 6.25 6.14
 ..- attr(*, "names")= chr [1:2] "V1" "V2"
$ ranges    : NULL
$ rotation  : NULL
$ method    : chr [1:3] "knnImpute" "scale" "center"
$ thresh    : num 0.95
$ pcaComp   : NULL
$ numComp   : NULL
$ ica       : NULL
$ k         : num 5
$ knnSummary:function (x, ...)  
$ bagImp    : NULL
$ median    : NULL
$ data      : num [1:12, 1:2] -1.434 -1.283 -0.981 -0.83 -0.377 ...
 ..- attr(*, "dimnames")=List of 2
     .. ..$ : chr [1:12] "1" "2" "4" "5" ...
 .. ..$ : chr [1:2] "V1" "V2"
 ..- attr(*, "scaled:center")= Named num [1:2] 10.5 10.5
 .. ..- attr(*, "names")= chr [1:2] "V1" "V2"
 ..- attr(*, "scaled:scale")= Named num [1:2] 6.63 6.63
 .. ..- attr(*, "names")= chr [1:2] "V1" "V2"
- attr(*, "class")= chr "preProcess"

Best Answer

Related Solutions

Solved – Feature selection with Caret for data with more than one target

Solved – Data imputation with preProcess in caret returns less observations than expected

Related Question