Solved – geometric mean for binary classification doesn’t use sensitivity of each class

accuracyclassificationscikit learnunbalanced-classes

scikit-learn's contrib package, imbalanced-learn, has a function, geometric_mean_score(), which has the following in its documentation:

The geometric mean (G-mean) is the root of the product of class-wise
sensitivity. This measure tries to maximize the accuracy on each of
the classes while keeping these accuracies balanced. For binary
classification G-mean is the squared root of the product of the
sensitivity and specificity. For multi-class problems it is a higher
root of the product of sensitivity for each class.

Why is sensitivity and specificity used for binary classification? In the sources below, geometric mean is defined as the geo mean of precision and recall.

Cross Validated answer

g-mean is defined as $g = \sqrt{\ Precision * Recall\ }$

Towards DS: Beyond Accuracy

There are other metrics for combining precision and recall, such as the Geometric Mean of precision and recall, but the F1 score is the most commonly used.

Best Answer

"G-mean" in itself does not refer to something other than the result of: $g=\sqrt{x\cdot y}$ when dealing with two variables $x$ and $y$. Therefore, unless formally defined I would be careful to interpreter what a particular author refers at.

That said, imbalanced-learn's geometric_mean_score() does the right calculation based on the reference they used. Kubat & Matwin (1997) Addressing the curse of imbalanced training sets: one-sided selection define the geometric mean $g$ based on the "accuracy on positive examples" and "accuracy on negative examples", which are defined respectively as the metrics Sensitivity (True Positive Rate - TPR) and Specificity (True Negative Rate - TNR). Therefore, the geometric_mean_score() function is correct; it reproduces the methodology presented by the references it cites.

Sensitivity and Specificity are informative metrics on how likely are we to detect instances from the Positive and Negative class respectively from our hold-out test sample. In that sense, Specificity is essentially our Sensitivity of detecting Negative class examples. This is further emphasised when looking at the multi-class version of the G-mean where we compute the $n$-th root of the product of Sensitivity for each class. In the case where $n=2$ and assuming we have classes A and B with the class A as the "Positive" one and class B as the "Negative" one, the Sensitivity of class B is just the Specificity in the binary classification. In the case where $n>2$, we cannot refer to "Positive" and "Negative" class (aside the context of one-vs-rest classification) so we just use the product of the per class Sensitivity score, i.e. $\sqrt[n]{x_1 \cdot x_2 \cdot \dots \cdot x_n }$ where $x_i$ here refers to the Recall score from the $i$-th class.

Let me stress that, Sensitivity and Specificity are metrics that dichotomise our output and should be, at first instance, avoided when optimising classifier performance. A more detailed discussion as to why metrics like Sensitivity and Accuracy that inherently dichotomise our outputs are often suboptimal can be found here: Why is accuracy not the best measure for assessing classification models?

Further commentary: I think there is some confusion on how this "g-mean" is defined, stems from the fact that the $F_1$ score is defined in terms of Precision (Positive Predictive Value - PPV) and Recall (TPR) and is the harmonic mean ($h = \frac{2 \cdot x \cdot y}{x+y}$) of the two. Some people might use the geometric mean $g$ instead of the harmonic mean $h$ thinking it is just another reformulation without realising that they redefining an existing metric. Please note that the geometric mean of Precision and Recall is not inherently wrong; just it is not what F-scores refer at nor what the papers cited by imbalanced-learn use.

Related Question