Cross Validation – Differences Between Cross Validation and Bootstrapping to Estimate Standard Error of AUC

aucbootstrapcross-validationpredictive-modelsroc

I know there's been some discussion on differences between CV and bootstrapping for estimating out-of-sample prediction error of a classifier.

For example, in here (Differences between cross validation and bootstrapping to estimate the prediction error), here (Bootstrapping estimates of out-of-sample error) and here (What is the .632+ rule in bootstrapping?).

I'm interested in, however, maximizing AUC directly, and not the prediction error (1 – accuracy) itself as the cutoff points are not specified a priori.

Would the reasoning of the posts above apply? I find it difficult, for example, to calculate the AUC with only 10 observations (assuming a CV10 applied to a sample of 100 observations).

Currently I'm using the "optimism" bootstrap estimator, though it is pretty expensive (at least for the PC I access to).

Any thoughts?

Best Answer

The AUC is equivalent to the c-index or concordance, the fraction of pairs of cases in which the ordering of the predictor value (based on the combination of predictor variables) is consistent with differences in outcome. So in principle if you wanted to do cross-validation you could use the paired comparisons of the held-out cases to calculate a concordance index (and thus an AUC) for each fold of CV.

But to get enough comparisons to be useful you would probably have to do not just one CV but multiple repeated cross validations with different subsetting of the cases. The many iterations of CV may offset your initial idea that fewer computing resources are required for CV than for bootstrapping. I find the bootstrap more straightforward and have used it for estimating standard errors of AUC values. If you only have on the order of 100 observations there shouldn't be that large a computational cost, so check the bootstrap algorithm you are using.

My hesitation is that although AUC might be considered a "neutral" evaluation metric, classifiers are typically used in a situation where there are different costs to false-positive and false-negative determinations. It's not clear that you would necessarily get the same result by "maximizing AUC directly" as you would by evaluating the cost-benefit tradeoffs in the context of how you plan to use your results. And your use of the word "maximizing" suggests that you might be trying to use AUC to compare among models, which might be better done with a different measure like the Akaike Information Criterion; see the accepted answer on this page and its comments.

Related Question