Machine Learning Classification – Difference Between ROC-AUC and Multiclass AUC (MAUC)

aucclassificationmachine learningmulti-classroc

I am trying to understand the interpretation of these metrics in a multiclass scenario: ROC-AUC and MAUC. Scikit-learn provides an implementation for ROC-AUC score, which can be used for both binary and multiclass problems.

However, some studies such as 2, 3, 4 and 5 suggest averaging class-wise AUC in multiclass.

My experiments with this metrics yield different results. Do they then evaluate to different quantities?

For clarity, I used 4 implementation as well as sklearn's ROC-AUC. I the case of sklearn's, I set the hyperparameters as:

roc_auc = metrics.roc_auc_score(y_test, ypred, average='weighted', 
    multi_class='ovo',labels=labels)

With Random Forest classifier, we obtained:

ROC-AUC:  0.58 # sklearn's roc-auc-score
MAUC:     0.69

This is more than a 10% difference so the two values are not close at all.

EDIT

References:

1 sklearn's roc-auc score: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html

2 Tanha, J., Abdi, Y., Samadi, N., Razzaghi, N., Asadpour, M.: Boosting methods
for multi-class imbalanced data classification: an experimental review. Journal of
Big Data 7(1), 1–47 (2020)

3 Wang, R., Tang, K.: An empirical study of MAUC in multi-class
problems with uncertain cost matrices. CoRR abs/1209.1800 (2012),
http://arxiv.org/abs/1209.1800

4 https://gist.github.com/stulacy/672114792371dc13b247

5 https://github.com/pritomsaha/Multiclass_AUC/blob/master/multiclass_auc.ipynb

Best Answer

The sklearn implementation offers different options for multi_class and average which explains the main difference: the Hand-Till paper and the implementations you linked use a one-vs-one approach as you do in your sklearn call, but also uses a macro-average compared to your weighted average approach.

There's another issue that prevents the scores from agreeing completely, though it will be more minor than the averaging issue. The gist and github implementations both sort the samples by probability, but ties are left up to the numpy sorting. In sklearn however, a tie of probabilities is handled by having a sloped line in the ROC curve, which also affects the area computation. Tweaking the toy example in [5] (since gunes pointed out the difference there, even though the classes are balanced and so the averaging plays no role) to have no ties in probability scores yields equal scores.

Related Solutions

Solved – Score for classification of dataset composed by different class with class imbalance

I'd like to highlight two possible options for multiclass performance metrics under class imbalance:

Kohen's Kappa (see here for details in scikit.learn), and
Computing one ROC curve and area under the curve (AUC) per individual class (see here for details in scikit.learn).

For the latter: as you have $N$ classes, and ROC/AUC are conceptually designed for 2-class-problems, you will likely need to calculate one ROC curve and AUC value per individual class. This could be done e.g. in a "1-vs-all" manner, where you test for each class how much it is confused with other classes. The thereby obtained $N$ metrics can be used to e.g. look at the distribution of AUC values (e.g. boxplots or similar) to compare and select a best suited model from multiple models. If this process needs to be done fully automated, consider computing the mean/median and sd/mad of AUC over all classes (the first indicates the "average" performance over classes, the latter the performance spread). By doing this for all models you obtain scalar values which you could use to select a model suited for your problem.

Classification – Which Performance Metrics Are Best for Highly Imbalanced Multiclass Datasets?

For unbalanced classes, I would suggest to go with Weighted F1-Score or Average AUC/Weighted AUC

Let's first see F1-Score for binary classification.

The F1-score gives a larger weight to lower numbers.

For example,

when Precision is 100% and Recall is 0%, the F1-score will be 0%, not 50%.
When let us say, we have Classifier A with precision=recall=80%, and Classifier B has precision=60%, recall=100%. Arithmetically, the mean of the precision and recall is the same for both models. But when we use F1’s harmonic mean formula, the score for Classifier A will be 80%, and for Classifier B it will be only 75%. Model B’s low precision score pulled down its F1-score.

Now, come to the Mutliclass Classification

Let us suppose we have the five classes, class_1, class_2, class_3, class_4, class_5

and the model is having the following results for each class.

forula for precision for each class = (True Positive for class)/(Count of predicted Positive for that class)

e.g. precision for class_1 = (True Positive for class_1)/(Count of Predicted of class_1)

forula for Recall for each class = (True Positive for class)/(Actual Positive for that class)

e.g. precision for class_1 = (True Positive for class_1)/(Total instances of class_1)

Formula for F1: F1 is the geometric mean of Precision and Recall i.e.

F1 = 2*(Precision*Recall)/(Precision+Recall)

Macro-F1 = Average(Class_1_F1 + Class_2_F1 + Class_3_F1 + Class_4_F1 + Class_5_F1)

Macro-Precision = Average(Class_1_Precision + Class_2_Precision + Class_3_Precision + Class_4_Precision + Class_5_Precision)

Macro-Recall = Average(Class_1_Recall + Class_2_Recall + Class_3_Recall + Class_4_Recall + Class_5_Recall)

Problem with Macro calculation: When averaging the macro-F1, we gave equal weights to each class.

Weighted F1 Score:

We don’t have to do that: in weighted-average F1-score, or weighted-F1, we weight the F1-score of each class by the number of samples from that class.

Weighted F1 Score = (N1*Class_1_F1 + N2*Class_2_F1 + N3*Class_3_F1 + N4*Class_4_F1 + N5*Class_5_F1)/(N1 + N2 + N3 + N4 + N5)

References: https://towardsdatascience.com/multi-class-metrics-made-simple-part-ii-the-f1-score-ebe8b2c2ca1

Best Answer

Related Solutions

Solved – Score for classification of dataset composed by different class with class imbalance

Classification – Which Performance Metrics Are Best for Highly Imbalanced Multiclass Datasets?

Related Question