Solved – How to interpret the number of k in k-nearest-neighbour classifier

classificationk nearest neighbourmachine learningpattern recognition

I have done some classification work using a k-nearest-neighbour classifier (kNN). And the classification performance is evaluated using cross-validation method. Some testing code from Matlab Help are

load ionosphere;
[N,D] = size(X)
resp = unique(Y)

rng(8000,'twister') % for reproducibility
K = round(logspace(0,log10(N),10)); % number of neighbors
cvloss = zeros(numel(K),1);
for k=1:numel(K)
    knn = ClassificationKNN.fit(X,Y,...
        'NumNeighbors',K(k),'CrossVal','On');
    cvloss(k) = kfoldLoss(knn);
end

figure; % Plot the accuracy versus k
semilogx(K,cvloss);
xlabel('Number of nearest neighbors');
ylabel('10 fold classification error');
title('k-NN classification');

We can tune the number of k in kNN and plot with respect to the cross-validation error. In this case, the plot is like below

k vs. cross-validation error

We can see that the plot shows a slightly larger error using very small value, e.g k=1. And the error goes down as we increase k and reaches the lowest error at k=2. And the error goes higher and higher and reaches some stable value after k>100.

My question is how to interpret the results?

  1. Why k is very small like k=1 we have slightly larger error?
  2. Why k is small we can obtain the best performance?
  3. Why when k is very large, we got very large error?

Please provide some reference if there are some mathematical background behind these. Thanks very much. A.

Best Answer

It all comes down to distance in the feature space into which your projecting your data. The $k$ in $k$NN tells the algorithm how many nearest neighbors, in terms of distance in your feature space, should be used to determine the class of an unknown data point, often by majority vote. Thinking of it in this way makes your observations more intuitive, I think: for your dataset, taking into account the class label of the two most similar, or nearest, neighbors to a data point seems to be a good way to go. Taking into account one is likely to small to overcome the variability of feature values in each class. Similarly, as you increase the number of data point contributing to the label of your new data point, you take into account points that are less and less similar to the point in question. This is why the experiment you ran is so important for $k$NN experiments! It is good practice to run cross-validation on a hold-out optimization data set over multiple $k$-values, to estimate the best choice for your data. Data sets in different domains—or even problem using data sets in the same domain, but different feature representations—are going to differ.