Solved – How to define the maximum k of the kNN classifier

classificationcross-validationk nearest neighbourpattern recognition

I am trying to use kNN classifier to perform some supervised learning. In order to find the best number of 'k' of kNN, I used cross validation. For example, the following codes load some Matlab standard data and run the cross validation to plot various k values with respect to the cross validation error

load ionosphere;
[N,D] = size(X)
resp = unique(Y)

rng(8000,'twister') % for reproducibility
K = round(logspace(0,log10(N),10)); % number of neighbors
cvloss = zeros(numel(K),1);
for k=1:numel(K)
    knn = ClassificationKNN.fit(X,Y,...
        'NumNeighbors',K(k),'CrossVal','On');
    cvloss(k) = kfoldLoss(knn);
end
figure; % Plot the accuracy versus k
plot(K,cvloss);
xlabel('Number of nearest neighbors');
ylabel('10 fold classification error');
title('k-NN classification');

The result looks like

Result

The best k in this case is k=2 (it is not an exhaustive search). From the figure, we can see that the cross validation error goes up dramatically after k>50. It gets to a large error and become stable after k>100.

My question is what is the maximum k we should test in this kind of cross validation framework?

For example, there are two classes in the 'ionosphere' data. One class labeled as 'g' and one labeled as 'b'. There are 351 instances in total. For 'g' there are 225 cases and for 'b' there are 126 cases.

In the codes above, it chooses the largest k=351 to be tested. But should we only test from 1 to 126 or up to 225? Is there a relation between the test cases and the maximum number of k? Thanks. A.

Best Answer

Consider the extreme case where we have a dataset that contains $N$ positive patterns and 1 negative pattern, then if $k$ is three or more, we will always classify every pattern as being positive, so the cross-validation accuracy will be the same for all $k >= 3$. That is a pretty small number, so the other extreme case must be the interesting one.

Next consider the other extreme case, where we have $N$ positive patterns and $N$ negative patterns. In this case if we set $k=2N-1$, then whether we classify a pattern as positive or negative depends on the distribution of the data, because we could have $N$ positive patterns and $N-1$ negative patterns closest to the test point or $N-1$ positive patterns and $N$ negative patterns. If $k$ is bigger than this, it will always be a tie as we have only $N$ positive patterns and $N$ negative patterns in the training set.

This suggests that to be sure we have looked at every non-redundant value of $k$ we may have to search up to the total number of training samples. If we have $N$ positive patterns and $M<N$ negative patterns, then I suspect you would need to search as high as $k = 2M+1$ (as an $k$-NN with $k$ greater than this will be guaranteed to have more positive than negative patterns).

I hope my meanderings on this are correct, this is just my intuition!

As an aside, if you use cross-validation to search for the best $k$ it can be done much more efficiently for small datasets if you rearrange the loops. Instead of the outer loop looping over values of $k$, make the outer loop loop over cross-validation folds instead. Then in each fold construct a matrix of pairwise distances between all training and all testing points. Sort them into ascending order. You can then evaluate the error for each value of $k$ by looking increasingly further down the list of sorted distances. This means you only need to compute the matrix of distances and sort it once, rather than once for each value of $k$. I often tune $k$ by minimising the leave-one-out error, which can be done very cheaply using this method.