Solved – K nearest neighbors with nested cross validation

cross-validationk nearest neighbourmodel selectionmodel-evaluation

I'm working on a binary classification problem on this dataset, using the k-nn algorithm.

For the performance evaluation and the parameter tuning (i.e. the choosing of k) I'm using the nested cross validation.

I split my dataset in 5 equal sized fold, and then I performed a cross validation for every training set/fold (i.e. I took from the fold the training set, on which I split it in 5 equal sized fold) for the k tuning. I've taken a specified set of values for the k tuning (1, 3, 5, 7, 9, 11, 13, etc)

For every nested cross validation I've taken the best k and I used it for the current fold evaluation.
I've drawn an example schema for explanation:

nested cross validation

I got, for example, these results (cross validation with 5 fold):

  • First fold, best k = 11, accuracy = 0.785:
    fold1
  • Second fold, best k = 11, accuracy = 0.776:
    enter image description here
  • Third fold, best k = 11, accuracy = 0.786:
    enter image description here
  • Fourth fold, best k = 11, accuracy = 0.791:
    enter image description here
  • Fifth fold, best k = 9, accuracy = 0.793:
    enter image description here

With an overall performance of 0.7853669 (accuracies mean)

Now, because I don't get for every fold the same best k (selection done with the inner cross validation), which k I need to use for my final model (the one which I will use for real classification)?

  • makes it sense to use the mean of the best k?

  • Or need I to do on all the dataset a inner cross validation for the final k selection? And saying that the expected performance will be the one evaluated with the nested cross validation?

Best Answer

I found out the response to my question

need I to do on all the dataset a inner cross validation for the final k selection? And saying that the expected performance will be the one evaluated with the nested cross validation?

  • short answer: yes.

  • long answer: The nested cross validation is needed to evaluate a process of learning and hyper parameters tuning, which means that at the end, if I want to select a k to use for my final model, I need to use my process of inner cross validation done on the different training sets obtained by the external cross validation split. The expected performance of this final model is what you evaluated with nested cross-validation earlier.