Solved – Intuition about f1 score

agreement-statisticsclassificationphilosophicalreliability

Let's say I have a data set where half of the data points are labelled as positive and half are labelled as negative. My task is to create a classifier which recognises when a sample from the dataset is positive.

The most useless classifier I could come up with would be to flip a fair coin whenever I get a sample to decide if the sample is positive or negative. One way to quantify the performance of that classifier is the f1 score which is in expectation 0.5 for a large dataset since recall is expected to be 0.5 and precision is expected to be 0.5. Additionally the f1 score should in principle be concentrated. That is intuitively the baseline f1 score I would compare everything to.

Now instead of that useless classifier I could create another classifier which in my opinion is just as useless where I simply say that all the samples I receive are positive. In that case my precision is still 0.5 but the recall is 1. This leads to an f1 score of 2/3 so if I use the f1 score as a measure to decide which classifier is better I should pick this one instead of the random one.

This is perhaps more of a philosophical pondering but in my opinion a way to select a classifier should not distinguish between those two classifiers (why should I prefer the second option?). Therefore I wanted to ask about alternatives that do pass such a test. I'd prefer a single number which is a function of true positive rate, false positive rate, true negative rate and false negative rate.

Best Answer

So this is a confusion matrix in case any readers haven't seen one: $$ \begin{array}{l c c} & Predict + & Predict-\\ Actual + & a & b \\ Actual - & c & d \\ \end{array} $$

And this is the formula for calculating $F_1$ from a confusion matrix: $$F_1 = \frac{2a}{2a+b+c}$$ So if half the items are positive in reality and all are predicted positive: $$\begin{array}{l c c} & Predict + & Predict-\\ Actual + & 50 & 0 \\ Actual - & 50 & 0 \\ \end{array}$$ Then $F_1$ in this case is equal to $2/3$ as you stated in your question: $$F_1 = \frac{2(50)}{2(50)+50+0} = 0.67$$ But if half the items are positive in reality and all the items are randomly predicted, then there are many different ways for this to occur and the resulting $F_1$ score will range between $0.00$ and $1.00$. Let's imagine we have just two items, then there are four possible results for the random classifier: $$\begin{array}{l c c c} Item & Reality & Predict_1 & Predict_2 & Predict_3 & Predict_4 \\ \hline 1 & + & + & + & - & -\\ 2 & - & + & - & + & -\\ \hline F_1 & & 0.67 & 1.00 & 0.00 & 0.00\\ \end{array}$$ So all that was just to say that the value of $F_1$ for the random classifier is actually quite variable, unlike the always-positive classifier. The alternative I'd suggest is the $S$ score first proposed by Bennett, Alpert, & Goldstein (1954). It assumes a 50% probability of classifying any given item into its correct class "by chance" alone and discounts this from the final score. It also uses all cells of the confusion matrix (unlike $F_1$ which ignores $d$ or the number of "true negatives"). $$p_o = \frac{a+d}{a+b+c+d}$$ $$S = \frac{p_o - 0.5}{1-0.5}=2p_o-1$$ In the first example above, where $F_1=0.67$, the $S$ score would be equal to $0.00$ and would capture the idea that the predictions are doing no better than would be expected by chance. $$p_o = \frac{50+0}{50+0+50+0}=0.50$$ $$S = 2(0.50)-1 = 0.00$$ In the second example above, where $0.00 \le F_1 \le 1.00$ with a mean of $0.42$, the $S$ score would actually range between $-1.00$ and $1.00$ with a mean of $0.00$. Thus, both classifiers would be deemed equal by the $S$ score: $$\begin{array}{l c c c} Item & Reality & Predict_1 & Predict_2 & Predict_3 & Predict_4 \\ \hline 1 & + & + & + & - & -\\ 2 & - & + & - & + & -\\ \hline F_1 & & 0.67 & 1.00 & 0.00 & 0.00\\ S & & 0.00 & 1.00 & -1.00 & 0.00 \\ \end{array}$$ You can find more information about classification reliability at my website, including a history of the S score and functions for calculating it in MATLAB and even generalizing it to multiple classifiers, multiple categories, and non-nominal categories.

Related Question