Solved – What exactly does nested K-Fold Cross-validation mean in terms of kNN

cross-validationk nearest neighbour

How do I implement nested K-fold cross-validation when it comes to k-Nearest Neighbours?

Let's say I built a kNN classifier, and used K-Fold CV to tune the hyper-parameter. Now, how do I use nested K-Fold CV? I have read multiple articles, but they don't explain it well enough (esp. in the case of kNN).

From my understanding, in nested CV:

I do K-Fold CV with K = 5 and k = 1, for example, on the training data and see the mean error rate. Then I do CV again with K = 10, for example, and k = 1, and then I do it again, with K = 15, for example, and k = 1, and so on, for multiple values of K.

Then I repeat the whole thing for k = 2, and so on, for multiple values of k.

In the end, I can use the data to plot a graph to see the what the mean error rate is with multiple values of k and for multiple values of K. So the X axis = k values, y axis = mean error rate and I can plot K lines.

And so I can look for the value of k with the minimum mean error rate, for the biggest K I could find, and use that value with the classifier to test out-of-sample accuracy on the test set.

Is that what is meant by nested CV?

Best Answer

That is not quite what is meant by nested CV.

Suppose your basic learning algorithm is "use 20-fold CV to find the best value of k, for $k = 1, 2, 3$". In order to assess the performance of this algorithm, you could again use CV, say 10-fold this time. As commented cbeleites, let's term these 10-folds "outer folds":

In outer fold 1, you would leave out the 1st 10th of the data; on the remainder of the data, you would perform 20-fold CV for each of $k = 1, 2, 3$, and note the best $$k you found. For this k, you would train on all of the data except for the 1st 10th, and check the performance on the 1st 10th.
In outer fold 2, you would leave out the 2nd 10th of the data; on the remainder of the data, you would perform 20-fold CV for each of $k = 1, 2, 3$, and note the best $k$ you found. For this $k4, you would train on all of the data except for the 2nd 10th, and check the performance on the 2nd 10th.
...
In outer fold 10, you would leave out the last 10th of the data; on the remainder of the data, you would perform 20-fold CV for each of $k = 1, 2, 3$, and note the best $k4 you found. For this k, you would train on all of the data except for the last 10th, and check the performance on the last 10th.

So the 10 outer folds give you altogether 10 CV estimates; the 20 inner folds each outer fold uses, is just for selecting $k$.

Related Solutions

Solved – KNN classifier + cross validation

The mean and standard deviation of you metrics are calculated across results of all cross validation (CV) partitions. So, if you have 10 CV partitions with 10 repeats you will obtain 100 sets of metrics, which in turn are used to compute the mean and standard deviation of each metric. This is not limited to KNN but applies do all models used with CV, therefore this should also answer your other question.

Assuming you are using a software like R: this is computed by the software already, so no need to do this on your own. For the purpose of understanding, here's a minimal working example on how to calculate it by hand anyway:

> library(caret)
> m <- train(iris[,1:4], 
>            iris[,5], 
>            method = 'knn', 
>            tuneGrid = expand.grid(k=1),
>            trControl=trainControl(method='repeatedcv', 
>                                   number=10, 
>                                   repeats=10))
> print(m)
    [...]
    Resampling results

    Accuracy  Kappa  Accuracy SD  Kappa SD
    0.96      0.94   0.0454       0.0682

> head(m$resample) # performances for individual partitions
    Accuracy Kappa     Resample
    1 0.9333333   0.9 Fold01.Rep01
    2 1.0000000   1.0 Fold02.Rep01
    3 1.0000000   1.0 Fold03.Rep01
    4 1.0000000   1.0 Fold04.Rep01
    5 0.9333333   0.9 Fold05.Rep01
    6 1.0000000   1.0 Fold06.Rep01
    [...]

> print(apply(m$resample[,1:2], MAR=2, mean)) # calculate mean/sd yourself
    Accuracy    Kappa 
        0.96     0.94

> print(apply(m$resample[,1:2], MAR=2, sd)) # calculate mean/sd yourself
    Accuracy      Kappa 
    0.04544332 0.06816498

Solved – Nested cross validation vs repeated k-fold

Nested cross-validation and repeated k-fold cross-validation have different aims. The aim of nested cross-validation is to eliminate the bias in the performance estimate due to the use of cross-validation to tune the hyper-parameters. As the "inner" cross-validation has been directly optimised to tune the hyper-parameters it will give an optimistically biased estimate of generalisation performance. The aim of repeated k-fold cross-validation, on the other hand, is to reduce the variance of the performance estimate (to average out the random variation caused by partitioning the data into folds). If you want to reduce bias and variance, there is no reason (other than computational expense) not to combine both, such that repeated k-fold is used for the "outer" cross-validation of a nested cross-validation estimate. Using repeated k-fold cross-validation for the "inner" folds, might also improve the hyper-parameter tuning.

If all of the models have only a small number of hyper-parameters (and they are not overly sensitive to the hyper-parameter values) then you can often get away with a non-nested cross-validation to choose the model, and only need nested cross-validation if you need an unbiased performance estimate, see:

Jacques Wainer and Gavin Cawley, "Nested cross-validation when selecting classifiers is overzealous for most practical applications", Expert Systems with Applications, Volume 182, 2021 (doi, pdf)

If, on the other hand, some models have more hyper-parameters than others, the model choice will be biased towards the models with the most hyper-parameters (which is probably a bad thing as they are the ones most likely to experience over-fitting in model selection). See the comparison of RBF kernels with a single hyper-parameter and Automatic Relevance Determination (ARD) kernels, with one hyper-parameter for each attribute, in section 4.3 my paper (with Mrs Marsupial):

GC Cawley and NLC Talbot, "On over-fitting in model selection and subsequent selection bias in performance evaluation", The Journal of Machine Learning Research 11, 2079-2107, 2010 (pdf)

The PRESS statistic (which is the inner cross-validation) will almost always select the ARD kernel, despite the RBF kernel giving better generalisation performance in the majority of cases (ten of the thirteen benchmark datasets).

Best Answer

Related Solutions

Solved – KNN classifier + cross validation

Solved – Nested cross validation vs repeated k-fold

Related Question