Solved – Which performance measure for unbalanced binary classification without an ‘active’ class

classificationmachine learningmodel-evaluationunbalanced-classes

My datasets have two classes A and B. The classes should be treated equally (there is no "active/inactive"). The datasets are unbalanced, sometimes A is more frequent, sometimes B is more frequent. Which performance measure should I use?

Accuracy makes no sense on unbalanced datasets. If I get it right, F-measure and AUC assume that there is a active class: F-measure ignores true negatives as it is the harmonic mean of precision and recall. AUC ignores true negatives and false negatives.

So what performance measure should I use?
Is AUC(active=A) + AUC(active=B) / 2 a valid option?

CORRECTION:

Apparently, I missunderstood how AUC works. It does NOT ignore true negatives and false negatives. The ROC curves look different depending on which class is considered the active one, but AUC(active=A) = AUC(active=B).

Best Answer

Have a look at the Matthews Correlation Coefficient

$$MCC = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{ (TP + FP)(TP + FN)(TN + FP)(TN + FN) }}$$

I have seen it pretty often as performance metrix in classification of SNPs dataset. Have a look at this link as well, they discuss the difference between AUC and MCC

Otherwise you can just compute an average accuracy (average error rate), I have seen people using it in multiclass problems as well.

$$AAcc = \frac{1}{2} \bigg( \frac{TP}{TP + FN} + \frac{TN}{TN + FP} \bigg) $$

Usually it is used in authentication systems under the form of Half Total Error Rate. E.g. here they provided a statistical test for that.

Introduction

Many practical applications have only positive and unlabeled data (aka PU learning), which poses problems in building and evaluating classifiers. Evaluating classifiers using only positive and unlabeled data is a tricky task, and can only be done by making some assumptions, which may or may not be reasonable for a real problem.

Shameless self-advertisement: For a detailed overview, I suggest reading my paper on the subject.

I will describe the main effects of the PU learning setting on performance metrics that are based on contingency tables. A contingency table relates the predicted labels to the true labels:

+---------------------+---------------------+---------------------+
|                     | positive true label | negative true label |
+---------------------+---------------------+---------------------+
| positive prediction | true positive       | false positive      |
| negative prediction | false negative      | true negative       |
+---------------------+---------------------+---------------------+

The problem in PU learning is that we don't know the true labels, which affects all cells in the contingency table (not just the last column!). It is impossible to make claims about the effect of the PU learning setting on performance metrics without making additional assumptions. For example, if your known positives are biased you can't make any reliable inference (this is common!).

Treating the unlabeled set as negative

A common simplification used in PU learning is to treat the unlabeled set as if it is negative, and then compute metrics as if the problem is fully supervised. Sometimes this is good enough, but this can be detrimental in many cases. I highly recommend against it.

Effect on precision. Say we want to compute precision:

$$p = \frac{TP}{TP + FP}.$$

Now, suppose we have a perfect classifier if we would know the true labels (i.e., no false positives, $p=1$). In the PU learning setting, using the approximation that the unlabeled set is negative, only a fraction of (in reality) true positives are marked as such, while the rest will be considered false positives, immediately yielding $\hat{p} < 1$. Obviously this is wrong, but it gets worse: the estimation error can be arbitrarily large, depending on the fraction of known positives over latent positives. Suppose only 1% of positives are known, and the rest is in the unlabeled set, then (still with a perfect classifier), we would get $\hat{p} = 0.01$ ... Yuck!

Effect on other metrics:

True Positives: underestimated
True Negatives: overestimated
False Positives: overestimated
False Negatives: underestimated
Accuracy: depends on balance and classifier

For AUC, sensitivity and specificity I recommend reading the paper as describing it in sufficient detail here would take us too far.

Start from the rank distribution of known positives

A reasonable assumption is that the known positives are a representative subset of all positives (e.g., they are a random, unbiased sample). Under this assumption, the distribution of decision values of known positives can be used as a proxy for the distribution of decision values of all positives (and hence also associated ranks). This assumption enables us to compute strict bounds on all entries of the contingency table, which then translates into (guaranteed!) bounds on all derived performance metrics.

A crucial observation we've made is that in the PU learning context under the assumption mentioned above is that the bounds on most performance metrics are a function of the fraction of positives in the unlabeled set ($\beta$). We have shown that computing (bounds on) performance metrics without an estimate of $\beta$ is basically impossible, as the bounds are then no longer strict.

Classification – Which Performance Metrics Are Best for Highly Imbalanced Multiclass Datasets?

For unbalanced classes, I would suggest to go with Weighted F1-Score or Average AUC/Weighted AUC

Let's first see F1-Score for binary classification.

The F1-score gives a larger weight to lower numbers.

For example,

when Precision is 100% and Recall is 0%, the F1-score will be 0%, not 50%.
When let us say, we have Classifier A with precision=recall=80%, and Classifier B has precision=60%, recall=100%. Arithmetically, the mean of the precision and recall is the same for both models. But when we use F1’s harmonic mean formula, the score for Classifier A will be 80%, and for Classifier B it will be only 75%. Model B’s low precision score pulled down its F1-score.

Now, come to the Mutliclass Classification

Let us suppose we have the five classes, class_1, class_2, class_3, class_4, class_5

and the model is having the following results for each class.

forula for precision for each class = (True Positive for class)/(Count of predicted Positive for that class)

e.g. precision for class_1 = (True Positive for class_1)/(Count of Predicted of class_1)

forula for Recall for each class = (True Positive for class)/(Actual Positive for that class)

e.g. precision for class_1 = (True Positive for class_1)/(Total instances of class_1)

Formula for F1: F1 is the geometric mean of Precision and Recall i.e.

F1 = 2*(Precision*Recall)/(Precision+Recall)

Macro-F1 = Average(Class_1_F1 + Class_2_F1 + Class_3_F1 + Class_4_F1 + Class_5_F1)

Macro-Precision = Average(Class_1_Precision + Class_2_Precision + Class_3_Precision + Class_4_Precision + Class_5_Precision)

Macro-Recall = Average(Class_1_Recall + Class_2_Recall + Class_3_Recall + Class_4_Recall + Class_5_Recall)

Problem with Macro calculation: When averaging the macro-F1, we gave equal weights to each class.

Weighted F1 Score:

We don’t have to do that: in weighted-average F1-score, or weighted-F1, we weight the F1-score of each class by the number of samples from that class.

Weighted F1 Score = (N1*Class_1_F1 + N2*Class_2_F1 + N3*Class_3_F1 + N4*Class_4_F1 + N5*Class_5_F1)/(N1 + N2 + N3 + N4 + N5)

References: https://towardsdatascience.com/multi-class-metrics-made-simple-part-ii-the-f1-score-ebe8b2c2ca1

Best Answer

Related Solutions

Solved – How are performance measures affected in PU learning

Introduction

Treating the unlabeled set as negative

Start from the rank distribution of known positives

Classification – Which Performance Metrics Are Best for Highly Imbalanced Multiclass Datasets?

Related Question