Machine Learning Classification – Difference Between ROC-AUC and Multiclass AUC (MAUC)

aucclassificationmachine learningmulti-classroc

I am trying to understand the interpretation of these metrics in a multiclass scenario: ROC-AUC and MAUC. Scikit-learn provides an implementation for ROC-AUC score, which can be used for both binary and multiclass problems.

However, some studies such as 2, 3, 4 and 5 suggest averaging class-wise AUC in multiclass.

My experiments with this metrics yield different results. Do they then evaluate to different quantities?

For clarity, I used 4 implementation as well as sklearn's ROC-AUC. I the case of sklearn's, I set the hyperparameters as:

roc_auc = metrics.roc_auc_score(y_test, ypred, average='weighted', 
    multi_class='ovo',labels=labels)

With Random Forest classifier, we obtained:

ROC-AUC:  0.58 # sklearn's roc-auc-score
MAUC:     0.69

This is more than a 10% difference so the two values are not close at all.

EDIT

References:

1 sklearn's roc-auc score: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html

2 Tanha, J., Abdi, Y., Samadi, N., Razzaghi, N., Asadpour, M.: Boosting methods
for multi-class imbalanced data classification: an experimental review. Journal of
Big Data 7(1), 1–47 (2020)

3 Wang, R., Tang, K.: An empirical study of MAUC in multi-class
problems with uncertain cost matrices. CoRR abs/1209.1800 (2012),
http://arxiv.org/abs/1209.1800

4 https://gist.github.com/stulacy/672114792371dc13b247

5 https://github.com/pritomsaha/Multiclass_AUC/blob/master/multiclass_auc.ipynb

Best Answer

The sklearn implementation offers different options for multi_class and average which explains the main difference: the Hand-Till paper and the implementations you linked use a one-vs-one approach as you do in your sklearn call, but also uses a macro-average compared to your weighted average approach.

There's another issue that prevents the scores from agreeing completely, though it will be more minor than the averaging issue. The gist and github implementations both sort the samples by probability, but ties are left up to the numpy sorting. In sklearn however, a tie of probabilities is handled by having a sloped line in the ROC curve, which also affects the area computation. Tweaking the toy example in [5] (since gunes pointed out the difference there, even though the classes are balanced and so the averaging plays no role) to have no ties in probability scores yields equal scores.

Related Question