Solved – Best way to average F-score with unbalanced classes

machine learningmeanscikit learnunbalanced-classes

I have a dataset with unbalanced classes. Three classes make up about 60% of the data. Also, I have different test splits that cause an imbalance. For e.g:

Train set:
label_1 … label_n

Test set:
label_1, label_3, label_9

This means that even though I have only 3 labels in my test set, it could potentially be predicted as 1 of n labels. So when I use sklearn.metrics.precision_recall_fscore_support, I get a matrix with a lot of zeros.

My problem is that I need to get an average F-score across all classes, rather than a per-class value. However, just taking an average of the matrix returned from the above sklearn function will always be a very low value since there are so many zeros. On the other hand, taking an average over non-zero values does not make sense to me either since the total number of potential predictions should be the total number of classes.

Is there a good way to take an average in this case? I've tried using the micro, macro and weighted average options but I am not sure which one is right.

Could anyone please help me with this?

Best Answer

"I am not sure which one is right"

There is no right or wrong here.

A classifier's performance can be represented using an $n\cdot n$ matrix. When trying to represent the performance using a single metric you lose some information.

In other words, since it is impossible to recover the confusion matrix based on a single metric, there is a loss of information when we consider only a single metric to interpret the performance of a classifier.

But still... to decide which classifier is better among several alternatives - we need a single metric...

Which single metric best represents the performance? That's a subjective questions. This is where statisticians become creative. This is why so many metrics have been purposed.

Different metrics 'prefer' different types of information that can be extracted from the confusion matrix. It is up to you to decide which one captures the information your regard as 'most important'.

Some criteria you may consider:

  • Are all classes are equally important / are all instances are equally important?
  • Are classification and misclassifications are equally 'important'?
  • Are false positives and false negatives are equally 'important'?
  • Should the performance be absolute, or relative to some random classifier?
  • Should the metric be linear in some sense?
  • etc.