By default, machine learning packages turn inverse distance weighting off for KNN. To me, it seems that inverse distance weighting is always a good option.
Why would we not want to use IDW with KNN? [And why would we want to?]
k nearest neighbourmachine learning
By default, machine learning packages turn inverse distance weighting off for KNN. To me, it seems that inverse distance weighting is always a good option.
Why would we not want to use IDW with KNN? [And why would we want to?]
Is there an advantage to using higher dimensions (2D, 3D, etc) or should you just build x-1 single dimension classifiers and aggregate their predictions in some way?
This depends on whether your features are informative or not. Do you suspect that some features will not be useful in your classification task? To gain a better idea of your data, you can also try to compute pairwise correlation or mutual information between the response variable and each of your features.
To combine all (or a subset) of your features, you can try computing the L1 (Manhattan), or L2 (Euclidean) distance between the query point and each 'training' point as a starting point.
Since building all of these classifiers from all potential combinations of the variables would be computationally expensive. How could I optimize this search to find the the best kNN classifiers from that set?
This is the problem of feature subset selection. There is a lot of academic work in this area (see Guyon, I., & Elisseeff, A. (2003). An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 3, 1157-1182. for a good overview).
And, once I find a series of classifiers what's the best way to combine their output to a single prediction?
This will depend on whether or not the selected features are independent or not. In the case that features are independent, you can weight each feature by its mutual information (or some other measure of informativeness) with the response variable (whatever you are classifying on). If some features are dependent, then a single classification model will probably work best.
How do most implementations apply kNN to a more generalized learning?
By allowing the user to specify their own distance matrix between the set of points. kNN works well when an appropriate distance metric is used.
You could also try the following package: DMwR.
It failed on the case of 3 NN, giving 'Error in knnImputation(x, k = 3) : Not sufficient complete cases for computing neighbors.'
However, trying 2 gives.
> knnImputation(x,k=2)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -0.59091360 -1.2698175 0.5556009 -0.1327224 -0.8325065 0.71664000
[2,] -1.27255074 -0.7853602 0.7261897 0.2969900 0.2969556 -0.44612831
[3,] 0.55473981 0.4748735 0.5158498 -0.9493917 -1.5187722 -0.99377854
[4,] -0.47797654 0.1647818 0.6167311 -0.5149731 0.5240514 -0.46027809
[5,] -1.08767831 -0.3785608 0.6659499 -0.7223724 -0.9512409 -1.60547053
[6,] -0.06153279 0.9486815 -0.5464601 0.1544475 0.2835521 -0.82250221
[7,] -0.82536029 -0.2906253 -3.0284281 -0.8473210 0.7985286 -0.09751927
[8,] -1.15366189 0.5341000 -1.0109258 -1.5900281 0.2742328 0.29039928
[9,] -1.49504465 -0.5419533 0.5766574 -1.2412777 -1.4089572 -0.71069839
[10,] -0.35935440 -0.2622265 0.4048126 -2.0869817 0.2682486 0.16904559
[,7] [,8] [,9] [,10]
[1,] 0.58027159 -1.0669137 0.48670802 0.5824858
[2,] -0.48314440 -1.0532693 -0.34030385 -1.1041681
[3,] -2.81996446 0.3191438 -0.48117020 -0.0352633
[4,] -0.55080515 -1.0620243 -0.51383557 0.3161907
[5,] -0.56808769 -0.3696951 0.35549191 0.3202675
[6,] -0.25043479 -1.0389393 0.07810902 0.5251606
[7,] -0.41667318 0.8809541 -0.04613332 -1.1586756
[8,] -0.06898363 -1.0736161 0.62698065 -1.0373835
[9,] 0.30051583 -0.2936140 0.31417921 -1.4155193
[10,] -0.68180034 -1.0789745 0.58290920 -1.0197956
You can test for sufficient observations using complete.cases(x), where that value must be at least k.
One way to overcome this problem is to relax your requirements (i.e. less incomplete rows), by 1) increasing the NA threshold, or alternatively, 2) increasing your number of observations.
Here is the first:
> x = matrix(rnorm(100),10,10)
> x.missing = x > 2
> x[x.missing] = NA
> complete.cases(x)
[1] TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
> knnImputation(x,k=3)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0.86882569 -0.2409922 0.3859031 0.5818927 -1.50310330 0.8752261 -0.5173105 -2.18244988 -0.28817656 -0.63941237
[2,] 1.54114079 0.7227511 0.7856277 0.8512048 -1.32442954 -2.1668744 0.7017532 -0.40086348 -0.41251883 0.42924986
[3,] 0.60062917 -0.5955623 0.6192783 -0.3836310 0.06871570 1.7804657 0.5965411 -1.62625036 1.27706937 0.72860273
[4,] -0.07328279 -0.1738157 1.4965579 -1.1686115 -0.06954318 -1.0171604 -0.3283916 0.63493884 0.72039689 -0.20889111
[5,] 0.78747874 -0.8607320 0.4828322 0.6558960 -0.22064430 0.2001473 0.7725701 0.06155196 0.09011719 -1.01902968
[6,] 0.17988720 -0.8520000 -0.5911523 1.8100573 -0.56108621 0.0151522 -0.2484345 -0.80695513 -0.18532984 -1.75115335
[7,] 1.03943492 0.4880532 -2.7588922 -0.1336166 -1.28424057 1.2871333 0.7595750 -0.55615677 -1.67765572 -0.05440992
[8,] 1.12394474 1.4890366 -1.6034648 -1.4315445 -0.23052386 -0.3536677 -0.8694188 -0.53689507 -1.11510406 -1.39108817
[9,] -0.30393916 0.6216156 0.1559639 1.2297105 -0.29439390 1.8224512 -0.4457441 -0.32814665 0.55487894 -0.22602598
[10,] 1.18424722 -0.1816049 -2.2975095 -0.7537477 0.86647524 -0.8710603 0.3351710 -0.79632184 -0.56254688 -0.77449398
> x
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0.86882569 -0.2409922 0.3859031 0.5818927 -1.5031033 0.8752261 -0.5173105 -2.18244988 -0.28817656 -0.63941237
[2,] 1.54114079 0.7227511 0.7856277 0.8512048 -1.3244295 -2.1668744 0.7017532 -0.40086348 -0.41251883 0.42924986
[3,] 0.60062917 -0.5955623 0.6192783 -0.3836310 0.0687157 1.7804657 0.5965411 -1.62625036 1.27706937 0.72860273
[4,] -0.07328279 -0.1738157 1.4965579 -1.1686115 NA -1.0171604 -0.3283916 0.63493884 0.72039689 -0.20889111
[5,] 0.78747874 -0.8607320 0.4828322 NA -0.2206443 0.2001473 0.7725701 0.06155196 0.09011719 -1.01902968
[6,] 0.17988720 -0.8520000 -0.5911523 1.8100573 -0.5610862 0.0151522 -0.2484345 -0.80695513 -0.18532984 -1.75115335
[7,] 1.03943492 0.4880532 -2.7588922 -0.1336166 -1.2842406 1.2871333 0.7595750 -0.55615677 -1.67765572 -0.05440992
[8,] 1.12394474 1.4890366 -1.6034648 -1.4315445 -0.2305239 -0.3536677 -0.8694188 -0.53689507 -1.11510406 -1.39108817
[9,] -0.30393916 0.6216156 0.1559639 1.2297105 -0.2943939 1.8224512 -0.4457441 -0.32814665 0.55487894 -0.22602598
[10,] 1.18424722 -0.1816049 -2.2975095 -0.7537477 0.8664752 -0.8710603 0.3351710 -0.79632184 -0.56254688 -0.77449398
here is an example of the 2nd...
x = matrix(rnorm(1000),100,10)
x.missing = x > 1
x[x.missing] = NA
complete.cases(x)
[1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
[22] FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[43] TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[64] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE
At least k=3 complete rows are satisfied, thus it is able to impute for k=3.
> head(knnImputation(x,k=3))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0.01817557 -2.8141502 0.3929944 0.1495092 -1.7218396 0.4159133 -0.8438809 0.6599224 -0.02451113 -1.14541016
[2,] 0.51969964 -0.4976021 -0.1495392 -0.6448184 -0.6066386 -1.6210476 -0.3118440 0.2477855 -0.30986749 0.32424673
...
Best Answer
The idea is to be able to be more robust against variations in distances of the k-nearest neighbors which may lead to wrong decisions. It leads to smoother decision surfaces.
The assumption is that neighbors closets to the sample should be given more relevance when deciding by voting to which class the sample belongs, since they are more similar.
This is specially relevant for samples close to the decision surfaces, which are sensible to effects like noise or sampling differences among classes. It is not the only alternative to cope with such problems though. For example in this paper an alternative weighting for the neighbors is given, which, besides of being more effective for unbalanced data, it has a nice probabilistic interpretation.