I have a dataset with 5 classes. About 98% of the dataset belong to class 5. Classes 1-4 share equally about 2% of the dataset. However, it is highly important, that classes 1-4 are correctly classified.
The accuracy is not a good measure of performance for my task. I found lots of information on metrics for imbalanced binary classification tasks but not on multiclass problems.
Which performance metrics should I use for such a task?
- TP, TN, FP, FN
- Precision
- Sensitivity
- Specificity
- F-score
- ROC-AUC (micro, macro, samples, weighted)
Best Answer
For unbalanced classes, I would suggest to go with Weighted F1-Score or Average AUC/Weighted AUC
Let's first see F1-Score for binary classification.
The F1-score gives a larger weight to lower numbers.
For example,
Now, come to the Mutliclass Classification
Let us suppose we have the five classes, class_1, class_2, class_3, class_4, class_5
and the model is having the following results for each class.
forula for precision for each class =
(True Positive for class)/(Count of predicted Positive for that class)
e.g. precision for class_1 = (True Positive for class_1)/(Count of Predicted of class_1)
forula for Recall for each class =
(True Positive for class)/(Actual Positive for that class)
e.g. precision for class_1 = (True Positive for class_1)/(Total instances of class_1)
Formula for F1: F1 is the geometric mean of Precision and Recall i.e.
Problem with Macro calculation: When averaging the macro-F1, we gave equal weights to each class.
Weighted F1 Score:
We don’t have to do that: in weighted-average F1-score, or weighted-F1, we weight the F1-score of each class by the number of samples from that class.
References: https://towardsdatascience.com/multi-class-metrics-made-simple-part-ii-the-f1-score-ebe8b2c2ca1