My question is about the 1-nearest neighbor classifier and is about a statement made in the excellent book The Elements of Statistical Learning, by Hastie, Tibshirani and Friedman. The statement is (p. 465, section 13.3):
"Because it uses only the training point closest to the query point, the bias of the 1-nearest neighbor estimate is often low, but the variance is high."
The book is available at
http://www-stat.stanford.edu/~tibs/ElemStatLearn/download.html
For starters, we can define what bias and variance are. From the question "how-can-increasing-the-dimension-increase-the-variance-without-increasing-the-bi" , we have that:
"First of all, the bias of a classifier is the discrepancy between its averaged estimated and true function, whereas the variance of a classifier is the expected divergence of the estimated prediction function from its average value (i.e. how dependent the classifier is on the random sampling made in the training set).
Hence, the presence of bias indicates something basically wrong with the model, whereas variance is also bad, but a model with high variance could at least predict well on average."
Could someone please explain why the variance is high and the bias is low for the 1-nearest neighbor classifier?
Best Answer
The bias is low, because you fit your model only to the 1-nearest point. This means your model will be really close to your training data.
The variance is high, because optimizing on only 1-nearest point means that the probability that you model the noise in your data is really high. Following your definition above, your model will depend highly on the subset of data points that you choose as training data. If you randomly reshuffle the data points you choose, the model will be dramatically different in each iteration. So
will be high, because each time your model will be different.
Example In general, a k-NN model fits a specific point in the data with the N nearest data points in your training set. For 1-NN this point depends only of 1 single other point. E.g. you want to split your samples into two groups (classification) - red and blue. If you train your model for a certain point p for which the nearest 4 neighbors would be red, blue, blue, blue (ascending by distance to p). Then a 4-NN would classify your point to blue (3 times blue and 1 time red), but your 1-NN model classifies it to red, because red is the nearest point. This means, that your model is really close to your training data and therefore the bias is low. If you compute the RSS between your model and your training data it is close to 0. In contrast to this the variance in your model is high, because your model is extremely sensitive and wiggly. As pointed out above, a random shuffling of your training set would be likely to change your model dramatically. In contrast, 10-NN would be more robust in such cases, but could be to stiff. Which k to choose depends on your data set. This highly depends on the Bias-Variance-Tradeoff, which exactly relates to this problem.