Solved – Data imputation with preProcess in caret returns less observations than expected

caretdata-imputationr

I wonder why preProcess function from R's caret package used for imputation of dataset's missing values returns less observations than in original dataset?
For example:

library(caret)

t <- data.frame(seq_len(100000),seq_len(100000))

for (i in 1:100000) 
{
if (i %% 10 == 0) t[i,1] <- NA; 
if (i %% 100 == 0) t[i,2] <- NA 
}

preProcValues <- preProcess(t, method = c("knnImpute"))

preProcValues will contain only 90000 observations of 2 variables while 100000 is expected.

Best Answer

preProcess does not return values, it simply sets up the whole imputation model based on the provided data. So, you need to run predict (requiring also the RANN package), but even if you do so with your artificial data you'll get an error:

Error in FUN(newX[, i], ...) : cannot impute when all predictors are missing in the new data point

as the imputation can not work in rows where both your predictors are NA's.

Here's a demonstration with only 20 rows, for clarity and easy inspection:

library(caret)

t <- data.frame(seq_len(20),seq_len(20))

for (i in 1:20) 
{
  if (i %% 3 == 0) t[i,1] <- NA; 
  if (i %% 7 == 0) t[i,2] <- NA 
}

names(t) <- c('V1', 'V2')

preProcValues <- preProcess(t, method = c("knnImpute"))

library(RANN)

t_imp <- predict(preProcValues, t)

When viewing the result, keep in mind that methods "center", "scale" have been automaticaly added to your preprocessing, even if you did not invoke them explicitly:

> str(preProcValues)
List of 19
$ call      : language preProcess.default(x = t, method = c("knnImpute"))
$ dim       : int [1:2] 12 2
$ bc        : NULL
$ yj        : NULL
$ et        : NULL
$ mean      : Named num [1:2] 10.5 10.5
 ..- attr(*, "names")= chr [1:2] "V1" "V2"
$ std       : Named num [1:2] 6.25 6.14
 ..- attr(*, "names")= chr [1:2] "V1" "V2"
$ ranges    : NULL
$ rotation  : NULL
$ method    : chr [1:3] "knnImpute" "scale" "center"
$ thresh    : num 0.95
$ pcaComp   : NULL
$ numComp   : NULL
$ ica       : NULL
$ k         : num 5
$ knnSummary:function (x, ...)  
$ bagImp    : NULL
$ median    : NULL
$ data      : num [1:12, 1:2] -1.434 -1.283 -0.981 -0.83 -0.377 ...
 ..- attr(*, "dimnames")=List of 2
     .. ..$ : chr [1:12] "1" "2" "4" "5" ...
 .. ..$ : chr [1:2] "V1" "V2"
 ..- attr(*, "scaled:center")= Named num [1:2] 10.5 10.5
 .. ..- attr(*, "names")= chr [1:2] "V1" "V2"
 ..- attr(*, "scaled:scale")= Named num [1:2] 6.63 6.63
 .. ..- attr(*, "names")= chr [1:2] "V1" "V2"
- attr(*, "class")= chr "preProcess"

Best Answer

Related Solutions

Solved – Imputing missing observation in multivariate time series

Solved – How to handle with missing values in order to prepare data for feature selection with LASSO

Related Question