Solved – Parameter tuning with vs without nested cross-validation

caretcross-validation

Disclaimer: This question has been inspired by this one, which is a good question but has unfortunately not attracted an answer that actually answers OPs question.

Statistical models often times have model parameters that need to be tuned in order to produce models with a good bias/variance tradeoff. This question relates to supposedly different flavors of parameter selection approaches I have been exposed to and I have difficulty reconciling in my mind.

Approach Nr. 1: Nested cross-validation

Nested cross-validation is the first approach to parameter tuning that I got to know. It consists of the following steps (assuming our algorithm of interest has only one parameter, can be easily extended to more):

  • Split your data into k training/test folds (regular cross-validation approach)
  • For each training fold k:
    • Split the training fold further into m internal folds (this refers to the term 'nested' in 'nested cross-validation'.
      • For each internal fold,
        • train one model on the internal training fold for each parameter in a parameter grid,
          evaluate on the internal test fold and return that parameter with lowest nested test error.
    • Take the optimal parameter returned, train a model using training fold k and evaluate your model on test fold k.

I find it very clear how this parameter selection approach produces models that do everything in their power to avoid overfitting while at the same time harnessing maximum amount of data, since both the model evaluation as well as parameter selection are fully cross validated.

Approach Nr. 2: The 'Caret' approach:

As the name implies, I was exposed to this approach when reading the documentation regarding parameter selection for the caret package. Parameter selection in the caret package apparently works as follows:

enter image description here

In this illustration (taken from the official caret documentation), the second loop ('resampling iteration') refers to one of many resampling strategies, one of them being 'simple' cross-validation (i.e. splitting of all data into k training/test folds without nested cross-validation).

In less general pseudo code, this would be equivalent to

  • For each parameter:
    • Split data into training/test folds (regular CV)
    • for each training fold:
      • train model on training fold
      • predict on test fold
    • return average performance (measured using your performance metric of choice)
  • For that parameter setting with highest average performance, retrain a model on all training data.

In terms of comparing the 1. and 2. approach, there are 3 things that are unclear to me:

  1. In the last line of the pseudocode in the 2. approach, it says to retrain the model on 'all training data' using the optimal parameters. To me, this implies splitting of all data into training and test (also called hold-out) set prior to applying everything explained in the pseudocode (this assumption is confirmed in the example following this pseudocode in the official caret documentation). This seems not optimal to me, as you get to 'use' less data for training (probably resulting in an objectively worse modell).

  2. If instead we retrain a model on all training data used for parameter selection, then how do we make sure we account for overfitting?

  3. On the other hand, all tutorials and overflow-posts I've studied do not do this holdout split. Instead, they assume that the classifications they get from calling the train function in caret are classifications that have been produced in a fashion that accounts for overfitting. At this point, I am uncertain how caret provides us with generalizable models when we use 'cv' or 'repeatedcv' in the trainControl function without a holdout split of the data to begin with

Best Answer

Part of the issue for #1 is terminology. We usually think of the training and test sets as the initial splitting that is done when you have assembled and cleaned your data. Resampling only happens on the training set; the test set is left for a final, unbiased evaluation of the model once you have singled one out as being the best.

When resampling, I have been using different terminology for the data used in the model and for the data held-out for immediate prediction. I call those the analysis and assessment sets respectively. So for simple 10-fold CV, each analysis set is 90% of the training set and the assessment set is 10%.

Your point about inefficient use of data with a training and test set is one complaint that I've heard over the years. However, it is good scientific practice to have a confirmatory data set that is only used to reaffirm the results that you obtained during the modeling process. There are ways to do resampling incorrectly and you would not know that this has occurred until you evaluate the next set of samples (that were not involved in the preceding analysis). Your point is valid but, unless your entire data set is pathologically small, the use of a test set far outweighs the inefficiency caused by the smaller training set.

For #2, the only way to really know when you are overfitting is with a separate data set (such as the assessment set). Whether that comes from nested resampling or non-nested (please don't call it the caret method), using the model to predict other samples is the only way to tell.

For #3, the process that I generally give to people is to do an initial training/test split, then resample the training set (using the same analysis/assessment splits across all testing). I generally use non-nested resampling (I'm the one who wrote caret) but nested sampling can be used too (more on that below). Executing the resampling process across different tuning parameters can be very effective at helping choose parameter values since overfitting is reflected in those statistics. Once you've settled on parameter values, the final model is refit on the entire training set.

Think of the process like this: the model-related operations are a module and this module can be applied to any data set. Resampling is a methods of estimating the performance of that module and was invented to emulate what the results would be for the module fit on the entire training set. Even though resampling can sometimes used less data when the module is repeatedly evaluated, it is still a good estimator of the final model that uses all the training data.

The documentation for the rsample package shows this at a more nuts-and-bolts level. For example, this page shows a neural network being tuned across epochs using simple 10-fold CV. In that example, you can see that the assessment sets (which would capture the effect of overfitting) are used to measure performance.

About nesting versus non-nesting: the main worry in non-nested resampling is optimization bias. If we evaluate a large number of tuning parameter values, there is some bias that we get by just choosing the best value. We are likely to be optimistic in our performance estimate. That is a real pattern and it is shown nicely in the papers that discuss it.

However... my experience is that, although real, this bias is very small in most cases (especially when compared to the experimental noise). I have yet to see a real data set when the non-nested resampling gave pathologically optimistic estimates. This vignette has a simulated case-study using rsample that is a good demonstration. If the cost of nested resampling were not so high, I would definitely be using it more often.