Cross-Validation – Choosing the Right Significance Test for 5-Fold Cross Validation

cross-validationstatistical significance

I have read this blog post, which states that either 5×2-fold, 10×10-fold or McNemar's test should be used for comparing two models on statistical significance, and does not suggest using nonparametric paired test (because k-fold cross-validation would create dependent samples and violate iid assumption): https://machinelearningmastery.com/statistical-significance-tests-for-comparing-machine-learning-algorithms/

The problem is I already have results from a 5-fold cross validation and I need to get the statistical significance values for them. Re-running the validation would take a significant amount of time, because the hyperparameters are optimised through Bayesian optimisation over several iterations. That's why, I need a test that can be applied at this stage for comparing the models, instead of rerunning through 5×2 or 10×10. The models are multi-class classifiers and I need to compare the loss of the models (a parameter I defined) for each fold (float value, not binary, hence I cannot do McNemar). Is there any test that is valid on this case? Please refer to papers to support your suggestions (this is necessary for an academic paper).

Best Answer

The underlying difficulty is that cross validation results (actually: all test results) are subject to several sources of variance (read the Dietterich and Yoshua & Bengio papers).

The usual tests the linked blog post discusses all assume that the data can be described using one variance term.

Sources of variance:

We're calculating the test results based on a finite number of test cases. The smaller the actual test set we use, the higher the variance our test results are subject to.
For figures of merit that are proportions of tested cases (e.g. accuracy) we can actually estimate this variance based on number of independent test cases and observed proportion via the binomial distribution.
The model(s) may be unstable, and thus the predictions subject to additional variance.
This can be instability originating from
- non-deterministic behaviour in the model training algorithm (I'll leave that aside for the rest of this answer) or from the
- deterministic model training algorithm being sensitive to the actual training cases.
  (for discussing k-fold cross validation we'll further divide this below)

Which (part) of these sources of variance is relevant depends on what question is actually asked (Dietterich makes a nice point of this) or in other words in which ways we want to generalize the findings:

(a) I'm from an applied field: we typically start from a data set and are then interested in how well the model we actually built on the available data will perform for unknown future cases of this application.
(b) People doing method/algorithm delopment are often interested in a different question: How well do models built with this algorithm perform on similar problems?

Here are some scenarios:

For answering (a), if we directly test the model in question with an independent test set (a verification/validation study), only variance source 1 is relevant: any instability-type variance is part of the performance of the model we actually examine.
So in that scenario, we can use e.g. a paired test (in case both models in question are tested with the same test cases). Which paired test to choose (McNemar vs. t-test vs. other tests) depends on the figure of merit we compare. McNemar for binary outcomes, t-test/z-test for approximately normally distributed figures of merit and so on.
Fortunately, we can estimate this variance as soon as we have sufficient test cases in our testing.
Still question (a): If we don't have independent test data at hand and go for resampling such as cross validation, that will be subject to some bias (depending on the learning curve of the models and the choice of $k$). Plus, instability starts to play a role: the surrogate models we actually test may vary around the average of the learning curve.
However, for the cross validation approximation of the figures of merit still for the models we actually get from the data set at hand, only that instability that happens due to training on a $1 - \frac{1}{k}$ subset of the data set at hand is relevant for the uncertainty of the performance of the model obtained from our data set.
This can be estimated e.g. from repeated/iterated k-fold cross validation or out-of-bootstrap and the like.
Now if we want to generalize both to unknown cases and models that are trained on another data set (of same/similar size) obtained from the same population (question b), we need to know how representative our data set is for the underlying training population. I.e. how much variance in the models we'd get if trained on $n$ new cases. That's what Bengio & Grandvalet are concerned with and what they show cannot be estimated from a single data set. This is also what the 5x2-fold scheme tries to get at - but at the price of a) having substantially smaller training sets for the surrogate models, and b) still having correlation since for each surrogate model, only 1 other surrogate model is independent, the other 8 are correlated as they share cases.

So if

you happen to be concerned with an a-type question here, and
you can show that the models are stable (which may be done by some further iterations of the cross validation or by showing that the 5 surrogate models you already have are equal to all practical purposes,

then you could approximately say that all variance comes from the finite number of cases tested and go for the pairwise test just as you'd do for the independent test set.

How to show stability:

via repeated/iterated k-fold: each case is tested exactly once per repetition/iteration. Any variance in the predictions of the same test case must originate from variation between the surrogate models, i.e. instability.
See e.g. our paper: Beleites, C. & Salzer, R.: Assessing and improving the stability of chemometric models in small sample size situations Anal Bioanal Chem, 2008, 390, 1261-1271.
DOI: 10.1007/s00216-007-1818-6
Other resampling schemes (out-of-bootstrap etc.) work as well, as long as you have several predictions of the same test case you can separate that variance from case-to-case variance.
without repeated/iterated k-fold: if the fitted parameters of the surrogate models are equal (or sufficiently similar) we also know that the models are stable. This is a stronger condition than the stability of the predictions, and it'll need some work to establish what order of magnitude of variation is sufficiently small.
Practically speaking, I'd say this may be doable for (bi)linear models where we can directly study the fitted coefficients, but will probably not be feasible for other types of models. (And in any case it may need more time than getting some further iterations of the k-fold while you personally work on other stuff)

Best Answer

Related Solutions

Solved – 10-fold cross validation, why having a validation set

Related Question