k-Nearest Neighbour – Why Does Error Rate of kNN Increase When k Approaches Size of Training Set?

classificationk nearest neighbourtrain-test-split

I've been experimenting with the effect that different values of k have on the generalisation error of kNN classifier, and I've gotten some unexpected results towards the end when k approaches the size of the dataset.

The result that I get is the following:

img

How experiment was run:

The dataset is based on sklearn's make_circles method, set up with the following parameters n_samples=500, factor=0.5 and noise=0.3.

The generalisation error was calculated as follows:

For each k in k = np.linspace(1, train_size - 1, 100)
   {
      generate data
      `train_test_split` with `test_size=0.2`
      fit model
      predict model
      calculate error
   } repeat 100 times and get average error

My interpretation:

For k up 150 I'm happy with the results. For low values of k, the total error is dominated by variance, for higher values of k, the total error is dominated by bias. So we get the classic u-shaped plot.

As k gets larger, the error rate converges to 50%. This also makes sense, since as you increase the value of k, you start considering a large enough portion of your dataset such that your predictor will just give you the class that appears more frequently in the training set. Since the classes are split equally, and it's a binary classification problem, we expect our model to get the right answer 50% of the time.

What I don't understand is why there seems to be an increase in the error rate for values of k> 370. I would have expected it to stay as is, getting right 50% of the time? Does anyone know why this would be the case?

I've tried running the experiment multiple times, and have also looked at the range form 370 to 399 more closely. Here is a plot:
img2

Best Answer

Adding my guess as an answer.

When splitting the data into train/test sets you do a 0.8/0.2 split. Which, with the sample size of 500, should be 400/100. Then you train kNN on 400 and test the accuracy on the remaining 100.

If the split is random, then the class balance will not be equal. Hence, the classifier will learn to predict the more frequent class on the training set, but due to the split the same class will be less frequent in the remaining (test) set. As a result, after the point the classifier starts to always return the more frequent class, your error, will be less than 50% on the training data but more than 50% on the testing data.

Your error (close to 55%) is also consistent with this interpretation. Here is a quick R simulation (ran 10000 times) to check the average error we expect on the testing set with such a strategy:

mean(replicate(10000, max(table(sample(rep(letters[1:2], each=250), 100)))))
[1] 53.5586

NOTE: for the simulation I didn't do any classification but simply simulated drawing random 100 samples out of equally balanced set of 500 samples and checking the frequency of higher class (the class your method would misclassify at k = 400).

For why this starts to happen at around k=385 - my guess is that this is the breaking point when the classifier starts to simply predict the most frequent class. Or in other words - within your training data there is about 15 samples worth of imbalance between the classes. And up to that point the classifier fluctuates between one class and another, but after this point each additional neighbour will be of the majority class.

Related Question