Solved – An intuitive understanding of each fold of a nested cross validation for parameter/model tuning

classificationcross-validationmodel selectionoptimization

There are several questions on this site essentially asking how nested cross validation for parameter tuning works. A lot of the answers use some jargon that I find difficult to understand, but as far as I can tell, the intuitive understanding of nested cross validation I have developed is as follows:

In standard cross validation, if you have 8 "runs", you will train a classifier on 7 runs and test on the remaining one. You do this 8 times, such that each run is the "testing set" once (therefore, 8-fold cross validation)

In nested cross validation, the training set is itself subjected to a cross validation with a 6-run training set and a 1-run validation set (and therefore has its own 7 "inner folds"). You can test many different parameter combinations, kernel functions, feature selection methods, etc by training the 6-run training sets with them and testing on the 1-run validation set. You repeat so that every set takes a turn as the validation set as in normal cross validation. After all combinations of models are tested via this "inner" cross validation, you can select the best-performing model across all inner folds and now train the entire 7-run training set with the selected model and test with the original testing set that has not been touched by the inner loop.

One fold of the "outer loop" (and seven folds of the "inner loop" inside the outer loop) has just been described. Now you do it all again, with each fold taking turns as the "testing" set waiting in the outer loop.

The above is what I understand nested cross validation to be. I have one general question and one specific question.

  1. In a broad sense, have I misunderstood anything obvious?
  2. What I just described could potentially yield different models for each "outer fold" of the data (aka different parameter selections, feature selections, etc). Is that valid/correct? It doesn't /feel/ valid to me. If not valid, then am I supposed to instead do all the inner loops first without doing a single outer loop, find the overall best model, and then run all 8 outer folds with the same model?

Best Answer

  1. Your understanding sounds good to me with the possible exception that what you call "run" in my field is called either fold (as in 5-fold cross validation) if the test data is meant or "surrogate model" if we're talking about the model.

  2. Yes, the outer folds can return different hyperparameter sets and/or parameters (coefficients).
    This is valid in the sense that this is allowed to happen. It is invalid in the sense that this means the optimization (done with the help of the inner folds) is not stable, so you have not actually found "the" [global] optimum.

For the overall model, you're supposed to run the inner cross validation again on the whole data set. I.e., you optimize/auto-tune your hyperparameters on the training set (now the whole data set) just the same as you did during the outer cross validation.


Update: longer explanation

See also Nested cross validation for model selection and How to build the final model and tune probability threshold after nested cross-validation?

Are you saying that to get the "the [global] optimum" you need to run your entire dataset on all the combinations of c's, gamma's, kernels etc?

No. In my experience the problem is not that the search space is not explored in detail (all possible combinations) but rather that our measurement of the resulting model performance is subject to uncertainty.

Many optimization strategies coming from numerical optimization implicitly assume that there is negligible noise on the target functional. I.e. the functional is basically a smooth, continuous function of the hyperparameters. Depending on the figure of merit you optimize and the number of cases you have, this assumption may or may not be met.

If you do have considerable noise on the estimate of the figure of merit but do not take this into account (i.e. the "select the best one" strategy you mention), your observed "optimum" is subject to noise.
In addition, the noise (variance uncertainty) on the performance estimate increases with model complexity. In this situation, naive "select be best observed performance" can also lead to a bias towards too complex models.

See e.g. Cawley, G. C. & Talbot, N. L. C.: On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, Journal of Machine Learning Research, 11, 2079-2107 (2010).

How does this get incorporated into the nested cross validation procedure or the final results of the analysis?

Hastie, T. and Tibshirani, R. and Friedman, J. The Elements of Statistical Learning; Data mining, Inference andPrediction Springer Verlag, New York, 2009 in chapter 7.10 say:

Often a “one-standard error” rule is used with cross-validation, in which we choose the most par- simonious model whose error is no more than one standard error above the error of the best model.

Which I find a good heuristic (I take the additional precaution to estimate both variance uncertainty due to the limited numebr of cases as well as due to model instability - the Elements of Statistical Learning do not discuss this in their cross validation chapter).


So your understanding:

I'm confused because my understanding is that you can't just run your analysis hundreds/thousands of times with different parameters/kernels and select the best one

is correct.

However, your understanding

(and nested CV is supposed to mitigate the associated issues).

may or may not be correct:

  • nested CV does not make the hyperparameter optimization any more successful,
  • but it can provide an honest estimate of the performance that can be achieved with that particular optimization strategy.

  • In other words: it guards against overoptimism about the achieved performance, but it does not improve this performance.


The final model:

  • The outer split of the nested CV is basically an ordinary CV for validation/verification. It splits the availabel data set into training and testing subsets, and then builds a so-called surrogate model using the training set.
  • During this training, you happen to do another (the inner) CV, whose performance estimates you use to fix/optimize the hyperparameters. But seen from the outer CV, this is just part of the model training.
  • The model training on the whole data set should just do the same the model training of the cross validation did. Otherwise the surrogate models and their performance estimates would not be a good surrogate for the model trained on the whole data (and that is really the purpose of the surrogate models).

  • Thus: run the auto-tuning of hyperparameters on the whole data set just as you do during cross validation. Same hyperparameter combinations to consider, same strategy for selecting the optimum. In short: same training algorithm, just slightly different data (1/k additional cases).

Related Question