Confidence Interval – Getting CI and p-Values for Cross Validated Performance Measures (AUC, Rho)

bootstrapconfidence intervalcross-validationp-value

I have a pretty small data set with (approx 150 obs) that I'm using to predict both a binary outcome variable and a continuous. Right now, I'm using nested cross validation as follows:

Outer loop is 20 times repeated 5-fold CV to produce 100 performance metrics. Each inner loop is 10-fold CV to tune hyper parameters.

I train a number of different algorithms (OLS, Lasso, Random Forest) for both problems yielding 100 measures of AUC and correlation coefficients for the classifier and regression respectively.

I've been trying to figure out what would be the best way to produce confidence intervals for these metrics and to do statistical tests that the predictive accuracy is better than random, and I'm sort of lost (samples are not independent). One option that has been presented here is to use a bootstrap (e.g. Confidence intervals for cross-validated statistics):

I'm not sure if I understood the procedure. Would this be the correct way to do it?

Instead of the outer CV loop do a bootstrap of the inner CV, where I train the models on a training set that consists of 80% of a sample with n=150 drawn with replacement.
In each bootstrap iteration, I then run a 10-fold CV to grid-search for hyper parameters.
After finding the best parametrization using the inner CV, I test it on the 20% remaining data, producing one performance metric.
Repeat 1000 to get enough measures.
Confidence Interval is then just between 2.5 and 97.5 percentile of observed metrics.

Problem is that this method would take ages to run. I also found this R package that could give CI for the AUC without me having to do 1000 iterations: https://github.com/ledell/cvAUC. I could then get a CI and p-value for rho just by running the outer CV 5 times instead and collecting all the predictions (150) to produce one rho value with its regular CI and p-value (H0: >0). Should I do this instead of the bootstrap, or how do the two alternatives compare?

Best Answer

You can do corrected resampled t-tests (Nadeau, 2003). They take care of this lacking independence in repeated cross validation. They are of course less powerful than normal t-tests, but probably the best that is available to you.

You also need your number of outer folds multiplied by your number of repetitions to be at least $30$ because the usual classification performance metrics cannot be assumed to be normally distributed. As you have $100$, you are on the safe side.

Use one sample t-tests since you have paired data. The algorithms have all been trained and tested on the same folds. First you compute all performance differences of two algorithms on the same fold $k\in \{1,\ldots , K\}$ in the same repetition $r\in \{1,\ldots , R\}$:

$$d_{kr}$$

The sample mean and variance are computed the usual way:

$$\hat{\mu}_d= \frac{1}{K\times R} \sum_{k=1}^K \sum_{r=1}^R d_{kr}$$ $$ \hat{\sigma}_d^2=\frac{1}{(K\times R) - 1} \sum_{k=1}^K \sum_{r=1}^R (d_{kr}-\hat{\mu_d})^2 $$

The following adjusted test statistic should be compared with regular student tables for $(K\times R) -1$ degrees of freedom:

$$T = \hat{\mu}_d\left/\sqrt{\left(\frac{1}{K\times R}+\frac{1/K}{1-1/K}\right)\hat{\sigma}_d^2}\right.$$

Instead of the usual test statistic that doesn't correct for non-independent samples.

$$T = \hat{\mu}_d\left/\sqrt{\frac{\hat{\sigma}_d^2}{K\times R}}\right.$$

If you want to split hairs, you can use the actual number of records in the current outer fold divided by the actual number of records in all other outer folds in the correction factor instead of the estimate $\frac{1/K}{1-1/K}$. That's only necessary for small data-sets though.

In your case, $\frac{1}{K\times R}=0.01$ and the correction factor $\frac{1/K}{1-1/K}=\frac{1}{K-1}=0.25$. So your standard error has been inflated compared to a normal t-test by a factor of:

$$\frac{\sqrt{0.01 + 0.25}}{\sqrt{0.01}}=5.01$$

Had you done 10 times repeated 10 fold CV instead (which weka does per default for example), the standard error would have been inflated by only factor:

$$\frac{\sqrt{0.01 + 1/9}}{\sqrt{0.01}}=3.48$$

For a given number of $K \times R$ sample points in the t-test, the correction is harsher the more repetitions and thereby less folds you have. A 100 fold CV without repetition would only inflate the standard error by a factor of 1.42. But you need huge data-sets if you want performance metrics on 1% of your records to behave like interval variables, so you cannot always do that.

For the confidence intervals, continue using the same correction factor as before, the same corrected standard error basically:

$$\hat{\mu_d} \pm t_{\alpha/2}^{(K\times R) -1} \times \sqrt{\left(\frac{1}{K\times R}+\frac{1/K}{1-1/K}\right)\hat{\sigma}_d^2}$$

I'm not sure if they are implemented in an R package, but honestly it takes you only a couple of lines to code this yourself.

Don't forget to correct for multiple comparisons (due to multiple algorithms being compared) afterwards.

Approach 1: require stable optimization results

With this approach, "model training" is the fitting of the "normal" model parameters, and hyperparameters are given. An inner e.g. cross validation takes care of the hyperparameter optimization.

The crucial step/assumption here to solve the dilemma of whose hyperparameter set should be chosen is to require the optimization to be stable. Cross validation for validation purposes assumes that all surrogate models are sufficiently similar to the final model (obtained by the same training algorithm applied to the whole data set) to allow treating them as equal (among themselves as well as to the final model). If this assumption breaks down and

the surrogate models are still equal (or equivalent) among themselves but not to the final model, we are talking about the well-known pessimistic bias of cross validation.
If also the surrogate model are not equal/equivalent to each other, we have problems with instability.

For the optimization results of the inner loop this means that if the optimization is stable, there is no conflict in choosing hyperparameters. And if considerable variation is observed across the inner cross validation results, the optimization is not stable. Unstable training situations have far worse problems than just the decision which of the hyperparameter sets to choose, and I'd really recommend to step back in that case and start the modeling process all over.

There's an exception, here, though: there may be several local minima in the optimization yielding equal performance for practical purposes. Requiring also the choice among them to be stable may be an unnecessary strong requirement - but I don't know how to get out of this dilemma.

Note that if not all models yield the same winning parameter set, you should not use outer loop estimates as generalization error here:

If you claim generalization error for parameters $p$, all surrogate models entering into the validation should actually use exactly these parameters.
(Imagine someone told you they did a cross validation on model with C = 1 and linear kernel and you find out some splits were evaluated with rbf kernel!)
But unless there is no decision involved as all splits yielded the same parameters, this will break independence in the outer loop: the test data of each split already entered the decision which parameter set wins as it was training data in all other splits and thus used to optimize the parameters.

Approach 2: treat hyperparameter tuning as part of the model training

This approach bridges the perspectives of the "training algorithm developer" and applied user of the training algorithm.

The training algorithm developer provides a "naked" training algorithm model = train_naked (trainingdata, hyperparameters). As the applied user needs tunedmodel = train_tuned (trainingdata) which also takes care of fixing the hyperparameters.

train_tuned can be implemented e.g. by wrapping a cross validation-based optimizer around the naked training algorithm train_naked.

train_tuned can then be used like any other training algorithm that does not require hyperparameter input, e.g. its output tunedmodel can be subjected to cross validation. Now the hyperparameters are checked for their stability just like the "normal" parameters should be checked for stability as part of the evaluation of the cross validation.

This is actually what you do and evaluate in the nested cross validation if you average performance of all winning models regardless of their individual parameter sets.

What's the difference?

We possibly end up with different final models taking those 2 approaches:

the final model in approach 1 will be train_naked (all data, hyperparameters from optimization)
whereas approach 2 will use train_tuned (all data) and - as that runs the hyperparameter optimization again on the larger data set - this may end up with a different set of hyperparameters.

But again the same logic applies: if we find that the final model has substantially different parameters from the cross validation surrogate models, that's a symptom of assumption 1 being violated. So IMHO, again we do not have a conflict but rather a check on whether our (implicit) assumptions are justified. And if they aren't, we anyways should not bet too much on having a good estimate of the performance of that final model.

I have the impression (also from seeing the number of similar questions/confusions here on CV) that many people think of nested cross validation doing approach 1. But generalization error is usually estimated according to approach 2, so that's the way to go for the final model as well.

Iris example

Summary: The optimization is basically pointless. The available sample size does not allow distinctions between the performance of any of the parameter sets here.

From the application point of view, however, the conclusion is that it doesn't matter which of the 4 parameter sets you choose - which isn't all that bad news: you found a comparatively stable plateau of parameters. Here comes the advantage of the proper nested validation of the tuned model: while you're not able to claim that the it is the optimal model, your're still able to claim that the model built on the whole data using approach 2 will have about 97 % accuracy (95 % confidence interval for 145 correct out of 150 test cases: 92 - 99 %)

Note that also approach 1 isn't as far off as it seems - see below: your optimization accidentally missed a comparatively clear "winner" because of ties (that's actually another very telltale symptom of the sample size problem).

While I'm not deep enough into SVMs to "see" that C = 1 should be a good choice here, I'd go with the more restrictive linear kernel. Also, as you did the optimization, there's nothing wrong with choosing the winning parameter set even if you are aware that all parameter sets lead to practically equal performance.

In future, however, consider whether your experience yields rough guesstimates of what performance you can expect and roughly what model would be a good choice. Then build that model (with manually fixed hyperparameters) and calculate a confidence interval for its performance. Use this to decide whether trying to optimize is sensible at all. (I may add that I'm mostly working with data where getting 10 more independent cases is not easy - if you are in a field with large independent sample sizes, things look much better for you)

long version:

As for the example results on the iris data set. iris has 150 cases, SVM with a grid of 2 x 2 parameters (2 kernels, 2 orders of magnitude for the penalty C) are considered.

The inner loop has splits of 129 (2x) and 132 (6x) cases. The "best" parameter set is undecided between linear or rbf kernel, both with C = 1. However, the inner test accuracies are all (including the always loosing C = 10) within 94 - 98.5 % observed accuracy. The largest difference we have in one of the splits is 3 vs. 8 errors for rbf with C = 1 vs. 10.

There's no way this is a significant difference. I don't know how to extract the predictions for the individual cases in the CV, but even assuming that the 3 errors were shared, and the C = 10 model made additional 5 errors:

> table (rbf1, rbf10)
         rbf10
rbf1      correct wrong
  correct     124     5
  wrong         0     3

> mcnemar.exact(rbf1, rbf10)

    Exact McNemar test (with central confidence intervals)

data:  rbf1 and rbf10
b = 5, c = 0, p-value = 0.0625
alternative hypothesis: true odds ratio is not equal to 1

Remember that there are 6 pairwise comparisons in the 2 x 2 grid, so we'd need to correct for multiple comparisons as well.

Approach 1

In 3 of the 4 outer splits where rbf "won" over the linear kernel, they actually had the same estimated accuracy (I guess min in case of ties returns the first suitable index).

Changing the grid to params = {'kernel':['linear', 'rbf'],'C':[1,10]} yields

({'kernel': 'linear', 'C': 1}, 0.95238095238095233, 0.97674418604651159)
({'kernel': 'rbf', 'C': 1}, 0.95238095238095233, 0.98449612403100772)
({'kernel': 'linear', 'C': 1}, 1.0, 0.97727272727272729)
({'kernel': 'linear', 'C': 1}, 0.94444444444444442, 0.98484848484848486)
({'kernel': 'linear', 'C': 1}, 0.94444444444444442, 0.98484848484848486)
({'kernel': 'linear', 'C': 1}, 1.0, 0.98484848484848486)
({'kernel': 'linear', 'C': 1}, 1.0, 0.96212121212121215)

Approach 2:

Here, clf is your final model. With random_state = 2, rbf with C = 1 wins:

In [310]: clf.grid_scores_
[...snip warning...]
Out[310]: 
[mean: 0.97333, std: 0.00897, params: {'kernel': 'linear', 'C': 1},
 mean: 0.98000, std: 0.02773, params: {'kernel': 'rbf', 'C': 1},
 mean: 0.96000, std: 0.03202, params: {'kernel': 'linear', 'C': 10},
 mean: 0.95333, std: 0.01791, params: {'kernel': 'rbf', 'C': 10}]

(happens about 1 in 5 times, 1 in 6 times linear and rbf with C = 1 are tied on rank 1)

Best Answer

Related Solutions

Solved – Confidence Intervals for AUC using cross-validation

Solved – How to get hyper parameters in nested cross validation