Solved – What exactly does nested K-Fold Cross-validation mean in terms of kNN

cross-validationk nearest neighbour

How do I implement nested K-fold cross-validation when it comes to k-Nearest Neighbours?

Let's say I built a kNN classifier, and used K-Fold CV to tune the hyper-parameter. Now, how do I use nested K-Fold CV? I have read multiple articles, but they don't explain it well enough (esp. in the case of kNN).

From my understanding, in nested CV:

I do K-Fold CV with K = 5 and k = 1, for example, on the training data and see the mean error rate. Then I do CV again with K = 10, for example, and k = 1, and then I do it again, with K = 15, for example, and k = 1, and so on, for multiple values of K.

Then I repeat the whole thing for k = 2, and so on, for multiple values of k.

In the end, I can use the data to plot a graph to see the what the mean error rate is with multiple values of k and for multiple values of K. So the X axis = k values, y axis = mean error rate and I can plot K lines.

And so I can look for the value of k with the minimum mean error rate, for the biggest K I could find, and use that value with the classifier to test out-of-sample accuracy on the test set.

Is that what is meant by nested CV?

Best Answer

That is not quite what is meant by nested CV.

Suppose your basic learning algorithm is "use 20-fold CV to find the best value of k, for $k = 1, 2, 3$". In order to assess the performance of this algorithm, you could again use CV, say 10-fold this time. As commented cbeleites, let's term these 10-folds "outer folds":

  • In outer fold 1, you would leave out the 1st 10th of the data; on the remainder of the data, you would perform 20-fold CV for each of $k = 1, 2, 3$, and note the best $$k you found. For this k, you would train on all of the data except for the 1st 10th, and check the performance on the 1st 10th.

  • In outer fold 2, you would leave out the 2nd 10th of the data; on the remainder of the data, you would perform 20-fold CV for each of $k = 1, 2, 3$, and note the best $k$ you found. For this $k4, you would train on all of the data except for the 2nd 10th, and check the performance on the 2nd 10th.

  • ...

  • In outer fold 10, you would leave out the last 10th of the data; on the remainder of the data, you would perform 20-fold CV for each of $k = 1, 2, 3$, and note the best $k4 you found. For this k, you would train on all of the data except for the last 10th, and check the performance on the last 10th.

So the 10 outer folds give you altogether 10 CV estimates; the 20 inner folds each outer fold uses, is just for selecting $k$.