Solved – Summarising Precision/Recall Measures in Multi-class Problem


I have a hierarchical multi-class classification system, that classifies records into about 500 different categories. I want to summarise the performance of the classifier in a simple way.

A measure of accuracy on validation data is easy to implement: correctly coded/all coded. For each class, we can look at binary measures of precision and recall to summarise the performance relative to that class.

However, there doesn't seem to be a generally accepted way to combine binary precision and recalls into summaries of precision and recall across the entire set of classes. There appear to be a few ways to approach this summary:

  1. Take a simple average (arithmetic/geometric/harmonic) of each class's precision/recall.

  2. Take a weighted average (weighted by number of examples, etc) of each class's precision/recall.

  3. Use bookmaker's informedness/markedness which seems to have a natural generalisation in the multiclass context.

Are there advantages to using one of these approaches particularly? Is there a generally accepted way to do this that I've just been missing?

Best Answer

As far as I know there isn't a "de facto" way of calculating precision and recall for multi-class classification.

Your approaches are what I too would try:

  1. Class-wise harmonic mean.
  2. Class-wise weighted harmonic mean (if the classes are imbalanced). With a weight equal to the class imbalance (i.e. class weight = number of class examples / number of total examples)
  3. Class-wise geometric mean (another approach if the classes are imbalanced).

There are also other metrics to evaluate the performance of your mode, besides precision and recall: