Evaluation Measures – Should Decisions Be Based on Micro-Averaged or Macro-Averaged Measures?

cross-validationmachine learning

I ran a 10-fold cross validation on different binary classification algorithms, with the same dataset, and received both Micro- and Macro averaged results. It should be mentioned that this was a multi-label classification problem.

In my case, true negatives and true positives are weighted equally. That means correctly predicting true negatives is equally important as correctly predicting true positives.

The micro-averaged measures are lower than the macro averaged ones. Here are the results of a Neural Network and Support Vector Machine:

enter image description here

I also ran a percentage-split test on the same dataset with another algorithm. The results were:

enter image description here

I would prefer to compare the percentage-split test with the macro-averaged results, but is that fair? I don't believe that the macro-averaged results are biased because true positives and true negatives are weighted equally, but then again, I wonder if this is the same as comparing apples with oranges?

UPDATE

Based on the comments I will show how the micro and macro averages are calculated.

I have 144 labels (the same as features or attributes) that I want to predict. Precision, Recall and F-Measure are calculated for each label.

---------------------------------------------------
LABEL1 | LABEL2 | LABEL3 | LABEL4 | .. | LABEL144
---------------------------------------------------
   ?   |    ?   |    ?   |   ?    | .. |     ?
---------------------------------------------------

Considering a binary evaluation measure B(tp, tn, fp, fn) that is calculated based on the true positives (tp), true negatives (tn), false positives (fp), and false negatives (fn). The macro and micro averages of a specific measure can be calculated as follows:

enter image description here

enter image description here

Using these formulas we can calculate the micro and macro averages as follows:

enter image description here

enter image description here

So, micro-averaged measures add all the tp, fp and fn (for each label), whereafter a new binary evaluation is made. Macro-averaged measures add all the measures (Precision, Recall, or F-Measure) and divide with the number of labels, which is more like an average.

Now, the question is which one to use?

Best Answer

If you think all the labels are more or less equally sized (have roughly the same number of instances), use any.

If you think there are labels with more instances than others and if you want to bias your metric towards the most populated ones, use micromedia.

If you think there are labels with more instances than others and if you want to bias your metric toward the least populated ones (or at least you don't want to bias toward the most populated ones), use macromedia.

If the micromedia result is significantly lower than the macromedia one, it means that you have some gross misclassification in the most populated labels, whereas your smaller labels are probably correctly classified. If the macromedia result is significantly lower than the micromedia one, it means your smaller labels are poorly classified, whereas your larger ones are probably correctly classified.

If you're not sure what to do, carry on with the comparisons on both micro- and macroaverage :)

This is a good paper on the subject.