Solved – In the context of KNN, why small K generates complex models

k nearest neighbourmachine learning

Section 1.4.8 of "Machine Learning: A Probabilistic Perspective by Kevin Patrick Murphy" gives this figure (Figure 1.21(a)) to illustrate the error rate of a KNN classifier for different values of k:

figure

And gives the following explanation:

We see that increasing K increases our error rate on the training set, because we are over-smoothing. As we said above, we can get minimal error on the training set by using K = 1, since this model is just memorizing the data.

However, what we care about is generalization error, which is the expected value of the misclassification rate when averaged over future data (see Section 6.3 for details). This can be approximated by computing the misclassification rate on a large independent test set, not used during model training. We plot the test error vs K in Figure 1.21(a) in solid red (upper curve). Now we see a U-shaped curve: for complex models (small K), the method overfits, and for simple models (big K), the method underfits.

Why small K generates more complex models?

Best Answer

kNN in essence is a density based classifier. You pick a point, expand a "window" around that point and pick the most frequent class within the window. It differs from "parzen window" (kernel density estimation) type classifiers in one aspect: the window size is variable - it expands up until it encapsulates $k$ observations.

With this picture think about how different $k$ parameters will influence the density estimation on your feature space. With small $k$ numbers you will get narrower "windows" - the density will have a lower bandwidth. And with higher $k$ values the density estimation will happen over larger areas. Since smaller $k$ models are more sensitive to abrupt changes - the models are more complex.

It might also be useful to think about the other extreme - $k = \infty$. Here the density will be estimated over the whole feature space and you will get the most simple classifier possible: for every new sample predict the most frequent class from the training data.


This can best be illustrated with an example of regression on one dimensional feature space. Say you have a a model that predicts a continuous value $y$, given a continuous feature $x$. And you have 10 points as your training set. The data can be visualised this way:

screen1

Then fitted models with $k = \{1,2,3,4\}$ would look like this:

screen2

In the above picture note one thing: the number of distinct values (steps) returned by the model is decreasing by 1 with every increase of the parameter $k$.

And of course the model for $k = \infty$ would produce:

screen3