Solved – KNN imputation R packages

data-imputationk nearest neighbourr

I am looking for a KNN imputation package. I have been looking at imputation package (http://cran.r-project.org/web/packages/imputation/imputation.pdf) but for some reason the KNN impute function (even when following the example from the description) only seems to impute zero values (as per below). I have been looking around but cannot find something yet, and hence was wondering if anyone has other suggestions for good KNN imputation packages?

W

In the code per below – the NA values are replaced by zero's – not by the Knn mean value

require(imputation)
x = matrix(rnorm(100),10,10)
x.missing = x > 1
x[x.missing] = NA
kNNImpute(x, 3)
x

Best Answer

You could also try the following package: DMwR.

It failed on the case of 3 NN, giving 'Error in knnImputation(x, k = 3) : Not sufficient complete cases for computing neighbors.'

However, trying 2 gives.

> knnImputation(x,k=2)
             [,1]       [,2]       [,3]       [,4]       [,5]        [,6]
 [1,] -0.59091360 -1.2698175  0.5556009 -0.1327224 -0.8325065  0.71664000
 [2,] -1.27255074 -0.7853602  0.7261897  0.2969900  0.2969556 -0.44612831
 [3,]  0.55473981  0.4748735  0.5158498 -0.9493917 -1.5187722 -0.99377854
 [4,] -0.47797654  0.1647818  0.6167311 -0.5149731  0.5240514 -0.46027809
 [5,] -1.08767831 -0.3785608  0.6659499 -0.7223724 -0.9512409 -1.60547053
 [6,] -0.06153279  0.9486815 -0.5464601  0.1544475  0.2835521 -0.82250221
 [7,] -0.82536029 -0.2906253 -3.0284281 -0.8473210  0.7985286 -0.09751927
 [8,] -1.15366189  0.5341000 -1.0109258 -1.5900281  0.2742328  0.29039928
 [9,] -1.49504465 -0.5419533  0.5766574 -1.2412777 -1.4089572 -0.71069839
[10,] -0.35935440 -0.2622265  0.4048126 -2.0869817  0.2682486  0.16904559
             [,7]       [,8]        [,9]      [,10]
 [1,]  0.58027159 -1.0669137  0.48670802  0.5824858
 [2,] -0.48314440 -1.0532693 -0.34030385 -1.1041681
 [3,] -2.81996446  0.3191438 -0.48117020 -0.0352633
 [4,] -0.55080515 -1.0620243 -0.51383557  0.3161907
 [5,] -0.56808769 -0.3696951  0.35549191  0.3202675
 [6,] -0.25043479 -1.0389393  0.07810902  0.5251606
 [7,] -0.41667318  0.8809541 -0.04613332 -1.1586756
 [8,] -0.06898363 -1.0736161  0.62698065 -1.0373835
 [9,]  0.30051583 -0.2936140  0.31417921 -1.4155193
[10,] -0.68180034 -1.0789745  0.58290920 -1.0197956

You can test for sufficient observations using complete.cases(x), where that value must be at least k.

One way to overcome this problem is to relax your requirements (i.e. less incomplete rows), by 1) increasing the NA threshold, or alternatively, 2) increasing your number of observations.

Here is the first:

> x = matrix(rnorm(100),10,10)
> x.missing = x > 2
> x[x.missing] = NA
> complete.cases(x)
 [1]  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
> knnImputation(x,k=3)
             [,1]       [,2]       [,3]       [,4]        [,5]       [,6]       [,7]        [,8]        [,9]       [,10]
 [1,]  0.86882569 -0.2409922  0.3859031  0.5818927 -1.50310330  0.8752261 -0.5173105 -2.18244988 -0.28817656 -0.63941237
 [2,]  1.54114079  0.7227511  0.7856277  0.8512048 -1.32442954 -2.1668744  0.7017532 -0.40086348 -0.41251883  0.42924986
 [3,]  0.60062917 -0.5955623  0.6192783 -0.3836310  0.06871570  1.7804657  0.5965411 -1.62625036  1.27706937  0.72860273
 [4,] -0.07328279 -0.1738157  1.4965579 -1.1686115 -0.06954318 -1.0171604 -0.3283916  0.63493884  0.72039689 -0.20889111
 [5,]  0.78747874 -0.8607320  0.4828322  0.6558960 -0.22064430  0.2001473  0.7725701  0.06155196  0.09011719 -1.01902968
 [6,]  0.17988720 -0.8520000 -0.5911523  1.8100573 -0.56108621  0.0151522 -0.2484345 -0.80695513 -0.18532984 -1.75115335
 [7,]  1.03943492  0.4880532 -2.7588922 -0.1336166 -1.28424057  1.2871333  0.7595750 -0.55615677 -1.67765572 -0.05440992
 [8,]  1.12394474  1.4890366 -1.6034648 -1.4315445 -0.23052386 -0.3536677 -0.8694188 -0.53689507 -1.11510406 -1.39108817
 [9,] -0.30393916  0.6216156  0.1559639  1.2297105 -0.29439390  1.8224512 -0.4457441 -0.32814665  0.55487894 -0.22602598
[10,]  1.18424722 -0.1816049 -2.2975095 -0.7537477  0.86647524 -0.8710603  0.3351710 -0.79632184 -0.56254688 -0.77449398
> x
             [,1]       [,2]       [,3]       [,4]       [,5]       [,6]       [,7]        [,8]        [,9]       [,10]
 [1,]  0.86882569 -0.2409922  0.3859031  0.5818927 -1.5031033  0.8752261 -0.5173105 -2.18244988 -0.28817656 -0.63941237
 [2,]  1.54114079  0.7227511  0.7856277  0.8512048 -1.3244295 -2.1668744  0.7017532 -0.40086348 -0.41251883  0.42924986
 [3,]  0.60062917 -0.5955623  0.6192783 -0.3836310  0.0687157  1.7804657  0.5965411 -1.62625036  1.27706937  0.72860273
 [4,] -0.07328279 -0.1738157  1.4965579 -1.1686115         NA -1.0171604 -0.3283916  0.63493884  0.72039689 -0.20889111
 [5,]  0.78747874 -0.8607320  0.4828322         NA -0.2206443  0.2001473  0.7725701  0.06155196  0.09011719 -1.01902968
 [6,]  0.17988720 -0.8520000 -0.5911523  1.8100573 -0.5610862  0.0151522 -0.2484345 -0.80695513 -0.18532984 -1.75115335
 [7,]  1.03943492  0.4880532 -2.7588922 -0.1336166 -1.2842406  1.2871333  0.7595750 -0.55615677 -1.67765572 -0.05440992
 [8,]  1.12394474  1.4890366 -1.6034648 -1.4315445 -0.2305239 -0.3536677 -0.8694188 -0.53689507 -1.11510406 -1.39108817
 [9,] -0.30393916  0.6216156  0.1559639  1.2297105 -0.2943939  1.8224512 -0.4457441 -0.32814665  0.55487894 -0.22602598
[10,]  1.18424722 -0.1816049 -2.2975095 -0.7537477  0.8664752 -0.8710603  0.3351710 -0.79632184 -0.56254688 -0.77449398

here is an example of the 2nd...

x = matrix(rnorm(1000),100,10)
x.missing = x > 1
x[x.missing] = NA

complete.cases(x)

  [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
 [22] FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [43]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [64] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
 [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE

At least k=3 complete rows are satisfied, thus it is able to impute for k=3.

> head(knnImputation(x,k=3))
            [,1]       [,2]       [,3]       [,4]       [,5]       [,6]       [,7]       [,8]        [,9]       [,10]
[1,]  0.01817557 -2.8141502  0.3929944  0.1495092 -1.7218396  0.4159133 -0.8438809  0.6599224 -0.02451113 -1.14541016
[2,]  0.51969964 -0.4976021 -0.1495392 -0.6448184 -0.6066386 -1.6210476 -0.3118440  0.2477855 -0.30986749  0.32424673
...
Related Question