Cross-Validation – Differences Between Internal and External CV in Model Selection

cross-validationestimationreferences

My understanding is that with cross validation and model selection we try to address two things:

P1. Estimate the expected loss on the population when training with our sample

P2. Measure and report our uncertainty of this estimation (variance, confidence intervals, bias, etc.)

Standard practice seems to be to do repeated cross validation, since this reduces the variance of our estimator.

However, when it comes to reporting and analysis, my understanding is that internal validation is better than external validation because:

It is better to report:

  • The statistics of our estimator, e.g. its confidence interval, variance, mean, etc. on the full sample (in this case the CV sample).

than reporting:

  • The loss of our estimator on a hold-out subset of the original sample, since:

    (i) This would be a single measurement (even if we pick our estimator with CV)

    (ii) Our estimator for this single measurement would have been trained on a set (e.g. the CV set) that is smaller than our initial sample since we have to make room for the hold-out set. This results in a more biased (pessimistic) estimation in P1 .

Is this correct? If not why?

Background:

It is easy to find textbooks that recommend dividing your sample into two sets:

  • The CV set, which is subsequently and repeatedly divided into train and validation sets.
  • The hold-out (test) set, only used at the end to report the estimator performance

My question is an attempt to understand the merits and advantages of this textbook approach, considering that our goal is to really address the problems P1 and P2 at the beginning of this post. It looks to me that reporting on the hold-out test set is bad practice since the analysis of the CV sample is more informative.

Nested K-fold vs repeated K-fold:

One can in principle combine hold-out with regular K-fold to obtain nested K-fold. This would allow us to measure the variability of our estimator, but it looks to me that for the same number of total models trained (total # of folds) repeated K-fold would yield estimators that are less biased and more accurate than nested K-fold. To see this:

  • Repeated K-fold uses a larger fraction of our total sample than nested K-fold for the same K (i.e. it leads to lower bias)
  • 100 iterations would only give 10 measurements of our estimator in nested K-fold (K=10), but 100 measurements in K-fold (more measurements leads to lower variance in P2)

What's wrong with this reasoning?

Best Answer

Let me add a few points to the nice answers that are already here:

Nested K-fold vs repeated K-fold: nested and repeated k-fold are totally different things, used for different purposes.

  • As you already know, nested is good if you want to use the inner cv for model selection.
  • repeated: IMHO you should always repeat the k-fold cv [see below].

I therefore recommend to repeat any nested k-fold cross validation.

Better report "The statistics of our estimator, e.g. its confidence interval, variance, mean, etc. on the full sample (in this case the CV sample).":

Sure. However, you need to be aware of the fact that you will not (easily) be able to estimate the confidence interval by the cross validation results alone. The reason is that, however much you resample, the actual number of cases you look at is finite (and usually rather small - otherwise you'd not bother about these distinctions).
See e.g. Bengio, Y. and Grandvalet, Y.: No Unbiased Estimator of the Variance of K-Fold Cross-Validation Journal of Machine Learning Research, 2004, 5, 1089-1105.

However, in some situations you can nevertheless make estimations of the variance: With repeated k-fold cross validation, you can get an idea whether model instability does play a role. And this instability-related variance is actually the part of the variance that you can reduce by repeated cross-validation. (If your models are perfectly stable, each repetition/iteration of the cross validation will have exactly the same predictions for each case. However, you still have variance due to the actual choice/composition of your data set). So there is a limit to the lower variance of repeated k-fold cross validation. Doing more and more repetitions/iterations does not make sense, as the variance caused by the fact that in the end only $n$ real cases were tested is not affected.

The variance caused by the fact that in the end only $n$ real cases were tested can be estimated for some special cases, e.g. the performance of classifiers as measured by proportions such as hit rate, error rate, sensitivity, specificity, predictive values and so on: they follow binomial distributions Unfortunately, this means that they have huge variance $\sigma^2 (\hat p) = \frac{1}{n} p (1 - p)$ with $p$ the true performance value of the model, $\hat p$ the observed, and $n$ the sample size in the denominator of the fraction. This has the maximum for $p = 0.5$. You can also calculate confidence intervals starting from the observation. (@Frank Harrell will comment that these are no proper scoring rules, so you anyways shouldn't use them - which is related to the huge variance). However, IMHO they are useful for deriving conservative bounds (there are better scoring rules, and the bad behaviour of these fractions is a worst-case limit for the better rules),
see e.g. C. Beleites, R. Salzer and V. Sergo: Validation of Soft Classification Models using Partial Class Memberships: An Extended Concept of Sensitivity & Co. applied to Grading of Astrocytoma Tissues, Chemom. Intell. Lab. Syst., 122 (2013), 12 - 22.

So this lets me turn around your argumentation against the hold-out:

  • Neither does resampling alone (necessarily) give you a good estimate of the variance,
  • OTOH, if you can reason about the finite-test-sample-size-variance of the cross validation estimate, that is also possible for hold out.

Our estimator for this single measurement would have been trained on a set (e.g. the CV set) that is smaller than our initial sample since we have to make room for the hold-out set. This results in a more biased (pessimistic) estimation in P1 .

Not necessarily (if compared to k-fold) - but you have to trade off: small hold-out set (e.g. $\frac{1}{k}$ of the sample => low bias (≈ same as k-fold cv), high variance (> k-fold cv, roughly by a factor of k).

It looks to me that reporting on the hold-out test set is bad practice since the analysis of the CV sample is more informative.

Usually, yes. However, it is also good to keep in mind that there are important types of errors (such as drift) that cannot be measured/detected by resampling validation.
See e.g. Esbensen, K. H. and Geladi, P. Principles of Proper Validation: use and abuse of re-sampling for validation, Journal of Chemometrics, 2010, 24, 168-187

but it looks to me that for the same number of total models trained (total # of folds) repeated K-fold would yield estimators that are less biased and more accurate than nested K-fold. To see this:

Repeated K-fold uses a larger fraction of our total sample than nested K-fold for the same K (i.e. it leads to lower bias)

I'd say no to this: it doesn't matter how the model training uses its $\frac{k - 1}{k} n$ training samples, as long as the surrogate models and the "real" model use them in the same way. (I look at the inner cross-validation / estimation of hyper-parameters as part of the model set-up).
Things look different if you compare surrogate models which are trained including hyper-parameter optimization to "the" model which is trained on fixed hyper-parameters. But IMHO that is generalizing from $k$ apples to 1 orange.

100 iterations would only give 10 measurements of our estimator in nested K-fold (K=10), but 100 measurements in K-fold (more measurements leads to lower variance in P2)

Whether this does make a difference depends on the instability of the (surrogate) models, see above. For stable models it is irrelevant. So may be whether you do 1000 or 100 outer repetitions/iterations.


And this paper definitively belongs onto the reading list on this topic: Cawley, G. C. and Talbot, N. L. C. On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, Journal of Machine Learning Research, 2010, 11, 2079-2107