Solved – Appropriate way to get Cross Validated AUC

auccross-validationmathematical-statistics

I was thinking about cross-validation and how it is the most appropriate way to do it…

Let's take the case of binary logistic regression where the goal is to calculate the AUC.

Make the partition of the data using k folds. What is the correct way to get the cross-validated AUC :

1) Train the model using k-1 folds and predict on the kth fold. Calculate the AUC and repeat until all folds served as test set. This will give at the end k AUC values, which we average to get the cross-validated AUC.

2) Train the model using k-1 folds and predict on the kth fold. Save the predictions. Repeat until all folds served as test set. This will give a vector of predictions, one for each subject in the dataset. Calculate the AUC using this vector of predictions and the vector of observed responses.

My intuition and idea of cross-validation suggest that 2) is the correct one…

Best Answer

As Provost explains in 'An Introduction to ROC Analysis', ROC averaging can be simply done by combining the scores from multiple sets $T_1, ..., T_k$ as you suggested in method (2). This is preferred to method (1) since it can be quite hard to average actual ROC curves, since the specificity (x-axis) values of the points are expected to be different. Therefore, you would need to do a lot of interpolation to average the curves. Another advantage is that the resulting curve from method (2) is smoother and approximates the AUC better, as a low number of scores tends to underestimate the AUROC (at least when calculated via the trapezoidal rule).

However, one should note that an advantage of method (1) is that it enables you to estimate the variance of the AUC.

Related Question