Solved – When is weighted average of $F_1$ scores $\simeq$ accuracy in classification

accuracyclassificationdata miningmachine learningmodel-evaluation

Example where accuracy $\simeq$ weighted average of $F_1$ scores

I have a classifier which classifies between 2 classes, $\mathbf{A}$ and $\mathbf{B}$. Say we have the confusion matrix below:

                  Predicted
                 A        B
             -------- --------
         A   |  103   |   5   
 Actual      -------- --------  
         B   |   3    |   97

The support for class $\mathbf{A}$ is $103 + 5 = 108$.
The support for class $\mathbf{B}$ is $3 + 97 = 100$.
The total number of data instances is $108 + 100 = 208$.

The accuracy can be calculated as follows: $ \frac{103\,+\,97}{208} = 0.961538462. $

The precision for class $\mathbf{A}$ can be calculated as $\frac{103}{103\,+\,3} = \frac{103}{106}$. Similarly, the precision for $\mathbf{B}$ is $\frac{97}{97\,+\,5} = \frac{97}{102}$.

The recall for $\mathbf{A}$ is $\frac{103}{103\,+\,5} = \frac{103}{108}$. The recall for $\mathbf{B}$ is $\frac{97}{97\,+\,3} = \frac{97}{100}$.

The $F_1$ score for class $\mathbf{A}$ is the harmonic mean of its precision and recall: $\left(\frac{1}{2}(\frac{106}{103} + \frac{108}{103})\right)^{-1} = \frac{103}{107}$.

The $F_1$ score for class $\mathbf{B}$ is the harmonic mean of its precision and recall: $\left(\frac{1}{2}(\frac{102}{97} + \frac{100}{97})\right)^{-1} = \frac{97}{101}$.

Finally, the weighted average of the $F_1$ scores where the weights are the support values: $\frac{\frac{103}{107} \cdot 108 + \frac{97}{101} \cdot 100}{208} = \frac{135089}{140491} = 0.96154913837$.

This value is very close to the accuracy. In fact they are equal up to 4 decimal places.

Classifier details

The classifier in question was trained by splitting the labeled data into a training and test set randomly. The training set was used to train the classifier and then the performance was measured on the test set.

This was repeated for different splits of the data into training and test sets. For every permutation, the accuracy and $F_1$ score are different but they are always close. They seem to fluctuate in the range $[0.9,0.97]$, and they fluctuate together.

Question

What does the accuracy and weighted average of $F_1$ scores with the support values as weights being so similar imply in a classification over 2 classes? Are the values always similar or does this observation imply something about my data or classifier? If the values are not always so similar, in what scenarios would they be significantly different?

Notes

I have checked this question which has a somewhat similar title, but it’s unrelated to what I’m asking.

Best Answer

Assessing the difference between a support-weighted mean $F1$ and accuracy

example confusion matrix

Class $A$'s $F1$

Using the classification outcomes $a$, $b$, $c$, $d$ as laid out in the confusion matrix above, the function for Class $A$'s $F1$ can be defined as: $$ F_{1;A} = \frac{2a}{(a+b)+(a+c)} $$

Class $B$'s $F1$

Similarly, the function for Class $B$'s $F1$ can be defined as: $$ F_{1;B} = \frac{2d}{(b+d)+(c+d)} $$

Support-weighted mean $F1$

Combining the $F1$s for Classes $A$ and $B$ into a support-weighted average and simplifying results in: $$ Support\text{-}weighted\text{ }mean_{F1} = \frac{(a+b) \cdot \frac{2a}{(a+b)+(a+c)} + (c+d) \cdot \frac{2d}{(b+d)+(c+d)}}{a+b+c+d}=\frac{a^2+ab+cd+d^2}{(a+b+c+d)^2} $$

Classification Accuracy

Finally, in the same regard, the function for the classification accuracy can be defined as: $$ Accuracy = \frac{a+d}{a+b+c+d} $$

Support-weighted mean $F1$ vs. Classification Accuracy - A Dinstinguishing Component

By setting the functions for the support-weighted mean $F1$ and accuracy equal, we can figure out which conditions involving $a$, $b$, $c$, $d$ determine similarity between the two statistics: $$ Support\text{-}weighted\text{ }mean_{F1} = Accuracy $$ $$ \frac{a^2+ab+cd+d^2}{(a+b+c+d)^2} = \frac{a+d}{a+b+c+d} $$ $$ a^2+ab+cd+d^2 = (a+d)(a+b+c+d) $$ $$ a^2+ab+cd+d^2 = a^2+ab+cd+d^2 + (ac+2ad+bd) $$

Therefore, the difference between the two statistics is smaller when the $\frac{ac+2ad+bd}{(a+b+c+d)^2}$ component of the support-weighted mean $F1$ (let's just call this the '$SWM\text{ }F1$ component') is minimal.

Simulating the Component

Using R, I simulated 2000 confusion matrices (with $a, b, c, d \sim Uniform(0,100)$) and created the following plot to demonstrate the influence of this $SWM\text{ }F1$ component on the difference between the two statistics:

simulation plot

The plot shows that small $SWM\text{ }F1$ components contribute to heightened similarity between the two statistics.

...and back to your question

To summarize, your support-weighted mean $F1$ and accuracy are similar because a particular component of the support-weighted mean $F1$ (i.e. $\frac{ac+2ad+bd}{(a+b+c+d)^2}$) is relatively small, particularly because of your small number of false predictions ($b$ and $c$).

It is also worth noting that test sets with small $n$ (as can be deduced from the component's formula) will be less likely to produce large differences between the two statistics.

...

Related Question