How to Use AUC via Cross-Validation for Selecting Optimal Model

auccross-validationglmnetregularization

I'm fitting a logit classifier with LASSO and cross-validation, and struggling to select the optimal model using AUC -instead of the more usual loss like binomial deviance or classification error. I add more detail on why I think this is better for the business problem at the bottom the post.

In my results to the customer I plan to use the classifier selected with binomial deviance, but I'd just like to get a deeper understanding on the reason for my inability to do it with AUC. For the graph below I'm using an open dataset, and I could provide as well the code to reproduce it, if it was needed. This is my curve when using AUC in glmnet as the loss for CV.

enter image description here

There is a maximum value, yes, but it does not look sensible to rely on this to select the lambda for the classifier. If I do it with binomial deviance the results are ok (I don't post the image not to overload the question).

Also, if I try with an "easier" classifying problem setup, both AUC and deviance are ok (actually give similar results, although not identical).

My question is, then: what makes a good loss for CV? After the facts, I can intuitively see why AUC would be a less stable measure, with higher variance, but I'm probably suffering "confirmation bias". Anyway, is there a formal why to establish this? Could you point me to useful resources?

As a reference, I'm using "The Elements of Statistical Learning", although there's little about this on there (however I believe glmnet is maintained by the same authors, and the library has a option for using AUC when CV to a classifier). The keywords "AUC" and "cross-validation" are not very discriminative when browsing this site or the web -or at least I have not been able to find much useful. I would really appreciate insights/pointers.


More context on the business problem: I'm fitting a classifier using uplift modeling; there is a treatment and a control group and the point is to find the best customers to target taking the available results of the experiment into account.

For evaluating the best model, the best reference that I know of is the Qini score, see Radcliffe, N. J. (2007). Using Control Groups to Target on Predicted Lift: Building and Assessing Uplift Models. Direct Marketing Analytics Journal, Direct Marketing Association, 14-21.

I was writing the CV loop myself to select the optimal classifier according to the Qini score, since this is the best measure of the goodness of the results (uplift is relatively uncommon and it is difficult to find these functions available off-the-shelf). My attempts didn't make much sense and I backed up to AUC as loss for CV, since at the end the concept of Qini is somewhat similar to AUC: the results did not make sense either.

I've stated my question above using AUC, since it is a common and understandable measure, also hoping that if I get a sense on why and when AUC is a good or bad measure, I could end up generalizing to Qini.

Thanks.

Best Answer

The concordance probability ($c$-index; ROC area) is not sensitive enough to be used to compare two models let alone a whole series of models. It rewards extreme predictions that are right too little because it uses only the ranks of the predictions, not their absolute values. Classification accuracy is even worse, being an arbitrary discontinuous improper accuracy score. For your purpose use a proper accuracy score such as deviance, pseudo $R^2$, or Brier score.