Solved – F1-Score in a multilabel classification paper: is macro, weighted or micro F1-used

classificationf1)multilabelscikit learn

I read this paper on a multilabel classification task. The authors evaluate their models on F1-Score but the do not mention if this is the macro, micro or weighted F1-Score.
They only mention:

We chose F1 score as the metric for evaluating
our multi-label classication system's performance. F1
score is the harmonic mean of precision (the fraction of
returned results that are correct) and recall (the frac-
tion of correct results that are returned).

From that, can I guess which F1-Score I should use to reproduce their results with scikit-learn? Or is it obvious which one is used by convention?

Edit:
I am not sure why this question is marked as off-topic and what would make it on topic, so I try to clarify my question and will be grateful for indications on how and where to ask this qustion.

As I understand it, the difference between the three F1-score calculations is the following:

  • macro calculates F1-score for each label and summs them up, with each label the same weight: $f1 = \sum f1_n *\frac{1}{n}$
  • weighted calculates F1-score for each label and sums them up multiplied by the support of each label: $f1 = \sum f1_n * w_n$
  • micro calculates a total f1-score by calculating precision and recall with the total true positives, false positives and false negatives.

The text in the paper seem to indicate that micro-f1-score is used, because nothing else is mentioned. is it save to think so?

Best Answer

I thought the macro in macro F1 is concentrating on the precision and recall other than the F1. We can calculate the macro precision for each label, and find their unweighted mean; by the same token its macro recall for each label, and find their unweighted mean. Once we get the macro recall and macro precision we can obtain the macro F1(please refer to here for more information). The same goes for micro F1 but we calculate globally by counting the total true positives, false negatives and false positives.

Macro F1 weighs each class equally while micro F1 weighs each sample equally, and in this case, most probably the F1 defaulted to the macro F1 since it's hard to make every tag with equal amount to prevent a bad micro F1 caused by the class imbalance(all tags would most probably not be of equal amount).

Let's come back to the paper, and in the paper, we can probably get some more hints from this snippet:

To debug our multi-label classification system, we examined which of the 20 most common tags had the worst performing classifiers (lowest F1 scores).

A macro F1 also makes error analysis easier.

Related Question