Let's say I have a classification problem with $c$ classes. For this, I have a data set containing $N$ distinct feature vectors with $n$ features. Let's say $N$ is of the order of $10^5$, and both $c$ and $n$ are of the order $10$, so that there is enough training data to make statistical reasonable statements.

I have now three different classifiers (say a RandomForest, a NeuralNetwork and a SVM) which I want to train on the data set and then get an estimate of how well each of the classifiers performed and how well each classifier generalizes.
Each classifier has hyper parameters (e.g. tree depth for the RandomForest, number of layers in the NeuralNetwork, C value for SVM etc.).

What is now the best way to decide, which classifier performs best? So how can I say "The best RandomForest has a tree depth of $x$ and performs $p$% better than the best NeuralNetwork (which has $h$ hidden layers)?"

My approach would be the following:

  1. For each of the three different classifiers, define a parameter grid of the hyperparameters, which should be analyzed.
  2. For each of the the three classifiers do an individual nested cross validation: According to this question on this site, the inner loop of the nested CV selects from the previously defined parameter grid the best set of hyperparameters. The outer loop then tells me, how stable this choice of hyperparameters is. If the standard deviation between the scores of the $k$ outer resulting models is small, then I know that the choice of hyperparameters is stable and not strongly dependent on the subset of the data I used for training.
  3. The process in step 2 allows me identify the best hyperparameters for each of the three classifiers. I fix these hyperparameters. Lets assume my model is stable and the hyperparameters do not vary strongly between the folds.
  4. I use the results of the outer cross validation from step 2, to get an unbiased estimate, how well each of the three classifiers performs. The one with the highest score in the nested CV is the one which presumably performs best on unseen new data.
  5. If I wanted to use one the three classifiers for further classification of unseen data, I would select the one with the highest nested CV score, as mentioned in step 4, and retrain this classifier with all data I have.

Is this a valid approach? Can I use the results of the nested CV to get an estimate of how well the classifier performs on unseen data or do I have to make a new k-fold CV with the best set of hyperparameters and use the results of this as my estimate?

Additionally: Is it valid to perform the nested CV for the three classifiers independently – as presented here – or do I have to do the following:

Perform nested CV where in the inner loop not only the hyperparameters of one classifier are tuned but each of the three classifiers with their respective hyperparameter grid is accessible. In this approach I would not know, how well the "best" of each of the three classifiers performs, right?

I try to answer my own question.

Lets start with the selection of the model:

  1. Should I use an SVM, Neural Network (NN) or Random Forest (RF)?

To answer this question, the following should be done:

  • For each of the three different classifiers, define a parameter grid of the hyperparameters, which should be analyzed
  • For each of the three classifiers, do a separate nested cross validation during which the hyperparameters are tuned
  • The selection of hyperparameters in the inner loop will vary from fold to fold. Therefore this is not a way to select any hyperparameters
  • Instead, use only the average score of the outer loops to get an unbiased estimate of the model under consideration (whereby model means here SVM, NN or RF)
  • For each model (again: SVM, NN or RF) one of these values is retrieved from the nested CV. This value is then an unbiased metric to tell which model works best on the task. Select this model.

Now a model is selected and I can tell whether I want to choose an SVM, NN and RF. Notice that the choice of hyperparameters was not made until now! Lets assume from now on that the Random Forest yielded the highest score so that we want to use it as our model.

  1. Which hyperparameters should I use for my model, (i.e. Random Forest)?

To answer this, do the following:

  • Use the grid of hyperparameters for the RandomForest defined before.
  • Loop over the hyperparameters (or combinations of hyperparameters): for params in hyperparameters:
    • Do k-fold cross validation (single, not nested!) during which we train the RF with params
    • Measure then mean score, i.e. the average score over all k folds
  • The hyperparameters with the highest mean score are the ones which should be chosen for the final model

Now the hyperparameters of the Ranfom Forest are determined.

We have two score values for the Random Forest: (i) the one from the nested cross validation where (possibly and almost certainly) different hyperparameters were considered in each inner fold and (ii) the score of the single cross validation during which we found our best hyperparameters for the RF.

The first one only told us which model to choose whereas the second one tells us how well the model including fixed hyperparameters performs.

A final model used for prediction of new unseen data would then be the Random Forest with the selected hyperparameters trained on all data.

There is just one small thing left open: I would think that the second score value (the one from the single CV) is a proper measure of how well our final model (including hyper parameters) works on new data. Is this correct? And could one say that in general the nested cv score of this model is either a) higher b) lower or c) equal to the non nested score? A comment on this would be highly appreciated.