Confidence Interval – Getting CI and p-Values for Cross Validated Performance Measures (AUC, Rho)

bootstrapconfidence intervalcross-validationp-value

I have a pretty small data set with (approx 150 obs) that I'm using to predict both a binary outcome variable and a continuous. Right now, I'm using nested cross validation as follows:

Outer loop is 20 times repeated 5-fold CV to produce 100 performance metrics. Each inner loop is 10-fold CV to tune hyper parameters.

I train a number of different algorithms (OLS, Lasso, Random Forest) for both problems yielding 100 measures of AUC and correlation coefficients for the classifier and regression respectively.

I've been trying to figure out what would be the best way to produce confidence intervals for these metrics and to do statistical tests that the predictive accuracy is better than random, and I'm sort of lost (samples are not independent). One option that has been presented here is to use a bootstrap (e.g. Confidence intervals for cross-validated statistics):

I'm not sure if I understood the procedure. Would this be the correct way to do it?

  • Instead of the outer CV loop do a bootstrap of the inner CV, where I train the models on a training set that consists of 80% of a sample with n=150 drawn with replacement.
  • In each bootstrap iteration, I then run a 10-fold CV to grid-search for hyper parameters.
  • After finding the best parametrization using the inner CV, I test it on the 20% remaining data, producing one performance metric.
  • Repeat 1000 to get enough measures.
  • Confidence Interval is then just between 2.5 and 97.5 percentile of observed metrics.

Problem is that this method would take ages to run. I also found this R package that could give CI for the AUC without me having to do 1000 iterations: https://github.com/ledell/cvAUC. I could then get a CI and p-value for rho just by running the outer CV 5 times instead and collecting all the predictions (150) to produce one rho value with its regular CI and p-value (H0: >0). Should I do this instead of the bootstrap, or how do the two alternatives compare?

Best Answer

You can do corrected resampled t-tests (Nadeau, 2003). They take care of this lacking independence in repeated cross validation. They are of course less powerful than normal t-tests, but probably the best that is available to you.

You also need your number of outer folds multiplied by your number of repetitions to be at least $30$ because the usual classification performance metrics cannot be assumed to be normally distributed. As you have $100$, you are on the safe side.

Use one sample t-tests since you have paired data. The algorithms have all been trained and tested on the same folds. First you compute all performance differences of two algorithms on the same fold $k\in \{1,\ldots , K\}$ in the same repetition $r\in \{1,\ldots , R\}$:

$$d_{kr}$$

The sample mean and variance are computed the usual way:

$$\hat{\mu}_d= \frac{1}{K\times R} \sum_{k=1}^K \sum_{r=1}^R d_{kr}$$ $$ \hat{\sigma}_d^2=\frac{1}{(K\times R) - 1} \sum_{k=1}^K \sum_{r=1}^R (d_{kr}-\hat{\mu_d})^2 $$

The following adjusted test statistic should be compared with regular student tables for $(K\times R) -1$ degrees of freedom:

$$T = \hat{\mu}_d\left/\sqrt{\left(\frac{1}{K\times R}+\frac{1/K}{1-1/K}\right)\hat{\sigma}_d^2}\right.$$

Instead of the usual test statistic that doesn't correct for non-independent samples.

$$T = \hat{\mu}_d\left/\sqrt{\frac{\hat{\sigma}_d^2}{K\times R}}\right.$$

If you want to split hairs, you can use the actual number of records in the current outer fold divided by the actual number of records in all other outer folds in the correction factor instead of the estimate $\frac{1/K}{1-1/K}$. That's only necessary for small data-sets though.

In your case, $\frac{1}{K\times R}=0.01$ and the correction factor $\frac{1/K}{1-1/K}=\frac{1}{K-1}=0.25$. So your standard error has been inflated compared to a normal t-test by a factor of:

$$\frac{\sqrt{0.01 + 0.25}}{\sqrt{0.01}}=5.01$$

Had you done 10 times repeated 10 fold CV instead (which weka does per default for example), the standard error would have been inflated by only factor:

$$\frac{\sqrt{0.01 + 1/9}}{\sqrt{0.01}}=3.48$$

For a given number of $K \times R$ sample points in the t-test, the correction is harsher the more repetitions and thereby less folds you have. A 100 fold CV without repetition would only inflate the standard error by a factor of 1.42. But you need huge data-sets if you want performance metrics on 1% of your records to behave like interval variables, so you cannot always do that.

For the confidence intervals, continue using the same correction factor as before, the same corrected standard error basically:

$$\hat{\mu_d} \pm t_{\alpha/2}^{(K\times R) -1} \times \sqrt{\left(\frac{1}{K\times R}+\frac{1/K}{1-1/K}\right)\hat{\sigma}_d^2}$$

I'm not sure if they are implemented in an R package, but honestly it takes you only a couple of lines to code this yourself.

Don't forget to correct for multiple comparisons (due to multiple algorithms being compared) afterwards.