Solved – Confidence Intervals for AUC using cross-validation

confidence intervalcross-validationroc

I am analyzing the performance of a predictive model with the AUC, area under the ROC curve. I repeat several times cross-validation, and I have different estimations of the AUC in each folder. For example, I repeat 10 times 10-fold CV and then, I have 100 estimations of AUC where I can calculate the MEAN(AUC) and the SD(AUC).
My question is: how could I use this for calculate a 95% confidence interval for the AUC?
These are some posible answers, but I am not sure if they are correct:

(1) Percentile 0.025 and 0.975 of the 100 sorted AUCs

(2) [ MEAN(AUC) – 1.96*SD(AUC) , MEAN(AUC) + 1.96*SD(AUC) ]

(3) [ MEAN(AUC) – 1.96*(SD(AUC)/sqrt(100)) , MEAN(AUC) + 1.96*(SD(AUC)/sqrt(100)) ]

Some comments:
– The (3) is similar to (2) but taking into account the sample size determined by the number of repetitions I decide to do, and then, it will be narrow if I increase these repetitions
– The intervals generated by (2) and (3) are symmetric

What do you think ?
Thank

Best Answer

Here is a sample of how you would do it in python.

from sklearn import cross_validation
scores = cross_validation.cross_val_score(your_model, your_data, y, cv=10)
mean_score = scores.mean()
std_dev = scores.std()
std_error = scores.std() / math.sqrt(scores.shape[0])
ci =  2.262 * std_error
lower_bound = mean_score - ci
upper_bound = mean_score + ci

print "Score is %f +/-  %f" % (mean_score, ci)
print '95 percent probability that if this experiment were repeated over and    
over the average score would be between %f and %f' % (lower_bound, upper_bound)

Related Solutions

Solved – Get CI and p-values for cross validated performance measures (AUC, rho)

You can do corrected resampled t-tests (Nadeau, 2003). They take care of this lacking independence in repeated cross validation. They are of course less powerful than normal t-tests, but probably the best that is available to you.

You also need your number of outer folds multiplied by your number of repetitions to be at least $30$ because the usual classification performance metrics cannot be assumed to be normally distributed. As you have $100$, you are on the safe side.

Use one sample t-tests since you have paired data. The algorithms have all been trained and tested on the same folds. First you compute all performance differences of two algorithms on the same fold $k\in \{1,\ldots , K\}$ in the same repetition $r\in \{1,\ldots , R\}$:

$$d_{kr}$$

The sample mean and variance are computed the usual way:

$$\hat{\mu}_d= \frac{1}{K\times R} \sum_{k=1}^K \sum_{r=1}^R d_{kr}$$ $$ \hat{\sigma}_d^2=\frac{1}{(K\times R) - 1} \sum_{k=1}^K \sum_{r=1}^R (d_{kr}-\hat{\mu_d})^2 $$

The following adjusted test statistic should be compared with regular student tables for $(K\times R) -1$ degrees of freedom:

$$T = \hat{\mu}_d\left/\sqrt{\left(\frac{1}{K\times R}+\frac{1/K}{1-1/K}\right)\hat{\sigma}_d^2}\right.$$

Instead of the usual test statistic that doesn't correct for non-independent samples.

$$T = \hat{\mu}_d\left/\sqrt{\frac{\hat{\sigma}_d^2}{K\times R}}\right.$$

If you want to split hairs, you can use the actual number of records in the current outer fold divided by the actual number of records in all other outer folds in the correction factor instead of the estimate $\frac{1/K}{1-1/K}$. That's only necessary for small data-sets though.

In your case, $\frac{1}{K\times R}=0.01$ and the correction factor $\frac{1/K}{1-1/K}=\frac{1}{K-1}=0.25$. So your standard error has been inflated compared to a normal t-test by a factor of:

$$\frac{\sqrt{0.01 + 0.25}}{\sqrt{0.01}}=5.01$$

Had you done 10 times repeated 10 fold CV instead (which weka does per default for example), the standard error would have been inflated by only factor:

$$\frac{\sqrt{0.01 + 1/9}}{\sqrt{0.01}}=3.48$$

For a given number of $K \times R$ sample points in the t-test, the correction is harsher the more repetitions and thereby less folds you have. A 100 fold CV without repetition would only inflate the standard error by a factor of 1.42. But you need huge data-sets if you want performance metrics on 1% of your records to behave like interval variables, so you cannot always do that.

For the confidence intervals, continue using the same correction factor as before, the same corrected standard error basically:

$$\hat{\mu_d} \pm t_{\alpha/2}^{(K\times R) -1} \times \sqrt{\left(\frac{1}{K\times R}+\frac{1/K}{1-1/K}\right)\hat{\sigma}_d^2}$$

I'm not sure if they are implemented in an R package, but honestly it takes you only a couple of lines to code this yourself.

Don't forget to correct for multiple comparisons (due to multiple algorithms being compared) afterwards.

Solved – Statistical significance (p-value) for comparing two classifiers with respect to (mean) ROC AUC, sensitivity and specificity

Wojtek J. Krzanowski and David J. Hand ROC Curves for Continuous Data (2009) is a great reference for all things related to ROC curves. It collects together a number of results in what is a frustratingly broad literature base, which often uses different terminology to discuss the same topic.

Additionally, this book offers commentary and comparisons of alternative methods which have been derived to estimate the same quantities, and points out that some methods make assumptions which may be untenable in particular contexts. This is one such context; other answers report the Hanley & McNeil method, which assumes the binormal model for distributions of scores, which may be inappropriate in cases where the distribution of class scores are not (close to) normal. The assumption of normally distributed scores seems especially inappropriate in modern machine-learning contexts, typical common models such as xgboost tend to produce scores with a "bathtub" distribution for classification tasks (that is, distributions with high densities in the extremes near 0 and 1).

Question 1 - AUC

Section 6.3 discusses comparisons of ROC AUC for two ROC curves (pp 113-114). In particular, my understanding is that these two models are correlated, so the information about how to compute $r$ is critically important here; otherwise, your test statistic will be biased because it doesn't account for the contribution of correlation.

For the case of uncorrelated ROC curves not based on any parametric distributional assumptions, statistics for tets and confidence intervals comparing AUCs can be straightforwardly based on estimates $\widehat{\text{AUC}}_1$ and $\widehat{\text{AUC}}_2$ of the AUC values, and estimates of their standard deviations $S_1$ and $S_2$, as given in section 3.5.1:

$$ Z = \frac{\widehat{\text{AUC}}_1 - \widehat{\text{AUC}}_2}{\sqrt{S_1^2 + S_2^2}} $$

To extend such tests to the case in which the same data is used for both classifiers, we need to take account of the correlation between the AUC estimates: $$ z=\frac{\widehat{\text{AUC}}_1 - \widehat{\text{AUC}}_2}{\sqrt{S_1^2 + S_2^2 - rS_1S_2}} $$

where $r$ is the estimate of this correlation. Hanley and McNeil (1983) made such an extension, basing their analysis on the binormal case, but only gave a table showing how to calculate the estimated correlation coefficient $r$ from the correlation $r_P$ of the two classifiers within class P, and the correlation of $r_n$ of the two classifiers within class N, saying that the mathematical derivation was available upon request. Various other authors (e.g. Zou, 2001) have developed tests based on the binormal model, assuming that an appropriate transformation can be found which will simultaneously transform the score distributions of classes P and N to normal.

DeLong et al (1988) took advantage of the identity between AUC and the Mann-Whitney test statistic, together with results from the theory of generalized $U$-statistics due to Sen (1960), to derive an estiamte of the correlation between the AUCs that does not rely on the binormal assumption. In fact, DeLong et al (1988) presented the following results for comparisons between $k\ge 2$ classifiers.

In Section 3.5.1, we showed that the area under the empirical ROC curve was equal to the Mann-Whitney $U$-statistic, and was given by

$$ \widehat{AUC}=\frac{1}{n_N n_P} \sum_{i=1}^{n_N} \sum_{j=1}^{n_P} \left[ I(s_{P_j} > s_{N_i}) + \frac{1}{2}I(s_{P_j} = s_{N_i}) \right] $$ where $s_{P_i}, i = 1, \dots,n_P$ are the score for the class $P$ objects and $s_{N_j}, j = 1, \dots,n_N$ are the scores for the class $N$ objects in the sample. Suppose that we have $k$ classifiers, yielding scores $s^r_{N_j}, j=1\dots n_N$ and $s_{P_i}^r, j = 1, \dots,n_P$ [I corrected an indexing error in this part - Sycorax], and $\widehat{AUC}_r, r = 1, \dots, k$. Define

$$ V^r_{10}=\frac{1}{n_N}\sum_{j=1}^{n_N} \left[ I(s_{P_i}^r > s_{N_j}^r) + \frac{1}{2}I(s_{P_i}^r = s_{N_j}^r) \right] , i=1,\dots,n_P $$ and $$ V^r_{01} = \frac{1}{n_P}\sum_{i=1}^{n_P} \left[ I(s_{P_i}^r > s_{N_j}^r) + \frac{1}{2}I(s_{P_i}^r = s_{N_j}^r) \right] , j=1,\dots,n_N $$

next, define the $k \times k$ matrix $\mathbf{W}_{10}$ with $(r,s)$th element $$ w_{10}^{r,s} = \frac{1}{n_P - 1}\sum_{i=1}^{n_P} \left[ V_{10}^r(s_{P_i}) - \widehat{AUC}_r \right] \left[ V_{10}^s(s_{P_i}) - \widehat{AUC}_s \right] $$ and the $k \times k$ matrix $\mathbf{W}_{01}$ with $(r,s)$th element $$ w_{01}^{r,s} = \frac{1}{n_N - 1}\sum_{i=1}^{n_N} \left[ V_{01}^r(s_{N_i}) - \widehat{AUC}_r \right] \left[ V_{01}^s(s_{N_i}) - \widehat{AUC}_s \right] $$ Then the estiamted covariance matrix for the vector $(\widehat{AUC}_1, \dots, \widehat{AUC}_k)$ of the estimated areas under the curves is $$ \mathbf{W} = \frac{1}{n_P}\mathbf{W}_{10} + \frac{1}{n_N}\mathbf{W}_{01} $$ with elements $w^{r,s}$. This is a generalization of the result for the estimated variance of a single estiamted AUC, also given in section 3.5.1. In the case of two classifiers, the estiamted correlation $r$ between the estimated AUCs is thus given by $\frac{w^{1,2}}{\sqrt{w^{1,1}w^{2,2}}}$ which can be used in $z$ above.

Since another answers gives the Hanley and McNeil expressions for estimators of AUC variance, here I'll reproduce the DeLong estimator from p. 68:

The alternative approach due to DeLong et al (1988) and exemplified by Pepe (2003) gives perhaps a simpler estimate, and one that introduces the extra useful concept of a placement value. The placement value of a score $s$ with reference to a specified population is that population's survivor function at $s$. This the placement value for $s$ in population N is $1 - F(s)$ and for $s$ in population P it is $1 - G(s)$. Empirical estimates of placement values are given by the obvious proportions. Thus the placement value of observation $s_{N_i}$ in population P denoted $s^P_{N_i}$, is the proportion of sample values from P that exceed $s_{N_i}$, and $\text{var}(s_{P_i}^N)$ is the variance of the placement values of each observation from N with respect to population P...

The DeLong et al (1988) estimate of variance of $\widehat{AUC}$ is given in terms of these variances: $$ s^2(\widehat{AUC}) = \frac{1}{n_P} \text{var}\left(s_{P_i}^N\right) + \frac{1}{n_N}\text{var}\left(s_{N_i}^P\right) $$

Note that $F$ is the cumulative distribution function of the scores in population N and $G$ is the cumulative distribution function of the scores in population P. A standard way to estimate $F$ and $G$ is to use the ecdf. The book also provides some alternative methods to the ecdf estimates, such as kernel density estimation, but that is outside the scope of this answer.

The statistics $Z$ and $z$ may be assumed to be standard normal deviates, and statistical tests of the null hypothesis proceed in the usual way. (See also: hypothesis-testing)

This is a simplified, high-level outline of how hypothesis testing works:

Testing, in your words, "whether one classifier is significantly better than the other" can be rephrased as testing the null hypothesis that the two models have statistically equal AUCs against the alternative hypothesis that the statistics are unequal.
This is a two-tailed test.
We reject the null hypothesis if the test statistic is in the critical region of the reference distribution, which is a standard normal distribution in this case.
The size of the critical region depends on the level $\alpha$ of the test. For a significance level of 95%, the test statistic falls in the critical region if $z > 1.96$ or $z < -1.96$. (These are the $\alpha/2$ and $1 - \alpha/2$ quantiles of the standard normal distribution.) Otherwise, you fail to reject the null hypothesis and the two models are statistically tied.

Question 1 - Sensitivity and Specificity

The general strategy for comparing sensitivity and specificity is to observe that both of these statistics amount to performing statistical inference on proportions, and this is a standard, well-studied problem. Specifically, sensitivity is the proportion of population P that has a score greater than some threshold $t$, and likewise for specificity wrt population N: $$ \begin{align} \text{sensitivity} = tp &= \mathbb{P}(s_P > t) \\ 1 - \text{specificity} = fp &= \mathbb{P}(s_N > t) \end{align} $$

The main sticking point is developing the appropriate test given that the two sample proportions will be correlated (as you've applied two models to the same test data). This is addressed on p. 111.

Turning to particular tests, several summary statistics reduce to proportions for each curve, so that standard methods for comparing proportions can be used. For example, the value of $tp$ for fixed $fp$ is a proportion, as is the misclassification rate for fixed threshold $t$. We can thus compare curves, using these measures, by means of standard tests to compare proportions. For example, in the unpaired case, we can use the test statistic $(tp_1 - tp_2) / s_{12}$, where $tp_i$ is the true positive rate for curve $i$ as the point in question, and $s_{12}^2$ is the sum of the variances of $tp_1$ and $tp_2$...

For the paired case, however, one can derive an adjustment that allows for the covariance between $tp_1$ and $tp_2$, but an alternative is to use McNemar's test for correlated proportions (Marascuilo and McSweeney, 1977).

The mcnemar-test is appropriate when you have $N$ subjects, and each subject is tested twice, once for each of two dichotomous outcomes. Given the definitions of sensitivity and specificity, it should be obvious that this is exactly the test that we seek, since you've applied two models to the same test data and computed sensitivity and specificity at some threshold.

The McNemar test uses a different statistic, but a similar null and alternative hypothesis. For example, considering sensitivity, the null hypothesis is that the proportion $tp_1 = tp_2$, and the alternative is $tp_1 \neq tp_2$. Re-arranging the proportions to instead be raw counts, we can write a contingency table $$ \begin{array}{c|c|c|} & \text{Model 1 Positive at } t & \text{Model 1 Negative at } t \\ \hline \text{Model 2 Positive at } t & a & b \\ \hline \text{Model 2 Negative at } t & c & d \\ \hline \end{array} $$ where cell counts are given by counting the true positives and false negatives according to each model

$$ \begin{align} a &= \sum_{i=1}^{n_P} I(s_{P_i}^1 > t) \cdot I(s_{P_i}^2 > t) \\ b &= \sum_{i=1}^{n_P} I(s_{P_i}^1 \le t) \cdot I(s_{P_i}^2 > t) \\ c &= \sum_{i=1}^{n_P} I(s_{P_i}^1 > t) \cdot I(s_{P_i}^2 \le t) \\ d &= \sum_{i=1}^{n_P} I(s_{P_i}^1 \le t) \cdot I(s_{P_i}^2 \le t) \\ \end{align} $$

and we have the test statistic $$ M = \frac{(b-c)^2}{b + c} $$ which is distributed as $\chi^2_1$ a chi-squared distribution with 1 degree of freedom. With a level $\alpha=95\%$, the null hypothesis is rejected for $M > 3.841459$.

For the specificity, you can use the same procedure, except that you replace the $s^r_{P_i}$ with the $s^r_{N_j}$.

Question 2

It seems that it is sufficient to merge the results by averaging the prediction values for each respondent, so that for each model you have 1 vector of 100 averaged predicted values. Then compute the ROC AUC, sensitivty and specificity statistics as usual, as if the original models didn't exist. This reflects a modeling strategy that treats each of the 5 respondents' models as one of a "committee" of models, sort of like an ensemble.