Solved – Is it pointless to use Bagging with nearest neighbor classifiers

baggingk nearest neighbour

In page 485 of the book [1], it is noted that "it is pointless to bag nearest-neighbor classifiers because their output changes very little if the training data is perturbed by sampling". This is strange to me because I think the KNN method has high variance when $K$ is small (such as for nearest neighbor method where $K$ is equal to one), and that makes it perfect for bagging. What is wrong with this intuition?

[1] Witten, Ian H., et al. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016.

Best Answer

In the original paper about bagging, Breiman refers to this point. He explains that unstable learners are likely to give different prediction for modified datasets and likely to benefit from bagging. On the other hand, stable learners (take to the extreme a constant), will give quite similar predictions anyway so bagging won't help.

He also refer to specific algorithms stability:

Unstability was studied in Breiman [1994] where it was pointed out that neural nets, classi cation and regression trees, and subset selection in linear regression were unstable,while k-nearest neighbor methods were stable.

Breiman [1994] is "Breiman,L.(1994)Heuristics of instability in model selection,Technical Report, Statistics Department, University of California at Berkeley."

I think the Breiman extended the technical report into Heuristics of Instability and Stabilization in Model Selection but he hardly refers to knn there.

I your intuition is correct. The lower k the more unstable the model will be. The more we modify the dataset, the higher probability the use a different set of neighbors. If you take k=1 and modify the dataset enough so the probability of getting the same neighbor is less then 80%, bagging should help. I think the the use case Breiman had in mind is a higher k and more delicate modifications. If you have k of 10 and probability of having the same neighbour of 99%, the results will be quite stable.