Solved – Classifier performance measure that combines sensitivity and specificity

classificationmodel-evaluationrocsensitivity-specificity

I have 2-classes labelled data on which I'm performing classification using multiple classifiers. And the datasets are well balanced. When assessing the classifiers' performance, I need to take into consideration how accurate the classifier is in determining not only the true positives, but the true negatives also. Therefore, if I use accuracy, and if the classifier is biased toward positives and classifies everything as positive, I will get around 50% accuracy, even though it failed at classifying any true negatives. This property is extended to precision and recall as they focus on one class only, and in turn to F1-score. (This is what I understand even from this paper for example "Beyond Accuracy, F-score and ROC: a Family of Discriminant Measures for Performance Evaluation").

Therefore, I can use sensitivity and specificity (TPR and TNR) to see how the classifier performed for each class, where I aim to maximize these values.

My question is that I am looking for a measure that combines both these values into one meaningful measure. I looked into the measures provided in that paper, but I found it to be non-trivial. And based on my understanding I was wondering why can't we apply something like the F-score, but instead of using precision and recall I would use sensitivity and specificity? So the formula would be
$$
\text{my Performance Measure} = \frac{2 * \text{sensitivity} * \text{specificity}}{\text{sensitivity} + \text{specificity}}
$$
and my aim would be to maximize this measure. I find it to be very representative. Is there a similar formula already? And would this make sense or is it even mathematically sound?

Best Answer

I would say that there might not be any particular or only one measure which you should take into account.

Last time when I did probabilistic classification I had a R package ROCR and explicit cost values for the False Positives and False Negatives.

I considered all cutoff-points from 0 to 1 and used many measures such as expected cost when selecting this cutoff - point. Of course I had already AUC measure for the general measure of classifying accuracy. But for me this was not the only possibility.

Values for the FP and FN cases must come outside your particular model, maybe these are provided by some subject matter expert?

For example in customer churn analysis it might be more expensive to incorrectly infer that customer is not churning but also that it will be expensive to give a general reduction in prices for services without accurary to target these to correct groups.

-Analyst

Introduction

Many practical applications have only positive and unlabeled data (aka PU learning), which poses problems in building and evaluating classifiers. Evaluating classifiers using only positive and unlabeled data is a tricky task, and can only be done by making some assumptions, which may or may not be reasonable for a real problem.

Shameless self-advertisement: For a detailed overview, I suggest reading my paper on the subject.

I will describe the main effects of the PU learning setting on performance metrics that are based on contingency tables. A contingency table relates the predicted labels to the true labels:

+---------------------+---------------------+---------------------+
|                     | positive true label | negative true label |
+---------------------+---------------------+---------------------+
| positive prediction | true positive       | false positive      |
| negative prediction | false negative      | true negative       |
+---------------------+---------------------+---------------------+

The problem in PU learning is that we don't know the true labels, which affects all cells in the contingency table (not just the last column!). It is impossible to make claims about the effect of the PU learning setting on performance metrics without making additional assumptions. For example, if your known positives are biased you can't make any reliable inference (this is common!).

Treating the unlabeled set as negative

A common simplification used in PU learning is to treat the unlabeled set as if it is negative, and then compute metrics as if the problem is fully supervised. Sometimes this is good enough, but this can be detrimental in many cases. I highly recommend against it.

Effect on precision. Say we want to compute precision:

$$p = \frac{TP}{TP + FP}.$$

Now, suppose we have a perfect classifier if we would know the true labels (i.e., no false positives, $p=1$). In the PU learning setting, using the approximation that the unlabeled set is negative, only a fraction of (in reality) true positives are marked as such, while the rest will be considered false positives, immediately yielding $\hat{p} < 1$. Obviously this is wrong, but it gets worse: the estimation error can be arbitrarily large, depending on the fraction of known positives over latent positives. Suppose only 1% of positives are known, and the rest is in the unlabeled set, then (still with a perfect classifier), we would get $\hat{p} = 0.01$ ... Yuck!

Effect on other metrics:

True Positives: underestimated
True Negatives: overestimated
False Positives: overestimated
False Negatives: underestimated
Accuracy: depends on balance and classifier

For AUC, sensitivity and specificity I recommend reading the paper as describing it in sufficient detail here would take us too far.

Start from the rank distribution of known positives

A reasonable assumption is that the known positives are a representative subset of all positives (e.g., they are a random, unbiased sample). Under this assumption, the distribution of decision values of known positives can be used as a proxy for the distribution of decision values of all positives (and hence also associated ranks). This assumption enables us to compute strict bounds on all entries of the contingency table, which then translates into (guaranteed!) bounds on all derived performance metrics.

A crucial observation we've made is that in the PU learning context under the assumption mentioned above is that the bounds on most performance metrics are a function of the fraction of positives in the unlabeled set ($\beta$). We have shown that computing (bounds on) performance metrics without an estimate of $\beta$ is basically impossible, as the bounds are then no longer strict.

Best Answer

Related Solutions

Solved – compute ROC from Sensitivity and Specificity

Solved – How are performance measures affected in PU learning

Introduction

Treating the unlabeled set as negative

Start from the rank distribution of known positives

Related Question