Solved – Should I use the opposite of an F1 score

precision-recall

The F1 score is commonly defined as the harmonic mean of precision and recall, which is equivalent to:

$$
\text{F1 score} = \frac{2 \times \mathit{TP}}{ 2 \times \mathit{TP} + \mathit{FP} + \mathit{FN} }
$$

This is a good measure of how well the algorithm is at picking out relevant items from a set. However, I am also interested in how good the algorithm is at leaving irrelevant items in the set, and therefore, I feel that the number of true negatives should be considered.

To do this, I'm considering calculating an F1 score where positives are treated as negatives and vice-versa. I would call this the "opposing F1 score". It would be calculated like this:

$$
\text{opposing F1 score} = \frac{ 2 \times \mathit{TN} }{ 2 \times \mathit{TN} + \mathit{FN} + \mathit{FP} }
$$

I could also calculate the opposing precision and recall values like this:

$$
\text{opposing precision} = \frac{\mathit{TN}}{\mathit{TN} + \mathit{FN}}
$$
$$
\text{opposing recall} = \frac{\mathit{TN}}{\mathit{TN}+\mathit{FP}}
$$

Then I could train the algorithm to achieve a maximal value of the average (for some definition of average) of the F1 score and the opposing F1 score, perhaps.

Is anybody else doing this? Are there better ways of achieving my aims?

Context:

The problem I'm working on is not the usual problem for these measures. The task is to create a subset of items where all duplicates are excluded. I am interested in both excluding the greatest number of duplicates, and in including the greatest number of non-duplicates. So far, I have been using the measures for precision, recall and F1 score on pairs of items (so a true positive would be when the algorithm correctly classifies a pair as a duplicate). But as you can see, these measure aren't perfectly suited for my use-case.

Best Answer

Although the two measures have quite different logic, the F1 score is actually a special case of the specific agreement coefficient. Specifically, F1 is equivalent to specific agreement for the positive class when there are two raters, two categories, and no missing data. The "opposing F1 score" that you propose is equivalent to specific agreement for the negative class in this same scenario.

I encourage you to investigate the specific agreement formulation, as it is more flexible than the F1 score (in that it can handle multiple raters, multiple categories, and missing data) and has an nice interpretation in addition to the harmonic mean of precision and recall: specific agreement is equal to the probability that a randomly selected rater will assign a given item to a category given that another randomly selected rater has also assigned that item to that category.