Evaluation Measures – Should Decisions Be Based on Micro-Averaged or Macro-Averaged Measures?

cross-validationmachine learning

I ran a 10-fold cross validation on different binary classification algorithms, with the same dataset, and received both Micro- and Macro averaged results. It should be mentioned that this was a multi-label classification problem.

In my case, true negatives and true positives are weighted equally. That means correctly predicting true negatives is equally important as correctly predicting true positives.

The micro-averaged measures are lower than the macro averaged ones. Here are the results of a Neural Network and Support Vector Machine:

enter image description here

I also ran a percentage-split test on the same dataset with another algorithm. The results were:

enter image description here

I would prefer to compare the percentage-split test with the macro-averaged results, but is that fair? I don't believe that the macro-averaged results are biased because true positives and true negatives are weighted equally, but then again, I wonder if this is the same as comparing apples with oranges?

UPDATE

Based on the comments I will show how the micro and macro averages are calculated.

I have 144 labels (the same as features or attributes) that I want to predict. Precision, Recall and F-Measure are calculated for each label.

---------------------------------------------------
LABEL1 | LABEL2 | LABEL3 | LABEL4 | .. | LABEL144
---------------------------------------------------
   ?   |    ?   |    ?   |   ?    | .. |     ?
---------------------------------------------------

Considering a binary evaluation measure B(tp, tn, fp, fn) that is calculated based on the true positives (tp), true negatives (tn), false positives (fp), and false negatives (fn). The macro and micro averages of a specific measure can be calculated as follows:

enter image description here

Using these formulas we can calculate the micro and macro averages as follows:

enter image description here

So, micro-averaged measures add all the tp, fp and fn (for each label), whereafter a new binary evaluation is made. Macro-averaged measures add all the measures (Precision, Recall, or F-Measure) and divide with the number of labels, which is more like an average.

Now, the question is which one to use?

Best Answer

If you think all the labels are more or less equally sized (have roughly the same number of instances), use any.

If you think there are labels with more instances than others and if you want to bias your metric towards the most populated ones, use micromedia.

If you think there are labels with more instances than others and if you want to bias your metric toward the least populated ones (or at least you don't want to bias toward the most populated ones), use macromedia.

If the micromedia result is significantly lower than the macromedia one, it means that you have some gross misclassification in the most populated labels, whereas your smaller labels are probably correctly classified. If the macromedia result is significantly lower than the micromedia one, it means your smaller labels are poorly classified, whereas your larger ones are probably correctly classified.

If you're not sure what to do, carry on with the comparisons on both micro- and macroaverage :)

This is a good paper on the subject.

Related Solutions

Classification – Comparing Mean Scores vs. Score Concatenation in Cross Validation

The described difference is IMHO bogus.

You'll observe it only if the distribution of truely positive cases (i.e. reference method says it is a positive case) is very unequal over the folds (as in the example) and the number of relevant test cases (the denominator of the performance measure we're talking about, here the truly positive) is not taken into account when averaging the fold averages.

If you weight the first three fold averages with $\frac{4}{12} = \frac{1}{3}$ (as there were 4 test cases among the total 12 cases which are relevant for calculation of the precision), and the last 6 fold averages with 1 (all test cases relevant for precision calculation), the weighted average is exactly the same you'd get from pooling the predictions of the 10 folds and then calculating the precision.

edit: the original question also asked about iterating/repeating the validation:

yes, you should run iterations of the whole $k$-fold cross validation procedure:
From that, you can get an idea of the stability of the predictions of your models

How much do the predictions change if the training data is perturbed by exchanging a few training samples?
I.e., how much do the predictions of different "surrogate" models vary for the same test sample?

You were asking for scientific papers:

search terms are iterated or repeated cross validation.
Papers that say "you should do this":
- Dougherty, E. R.; Sima, C.; Hua, J.; Hanczar, B. & Braga-Neto, U. M.: Performance of Error Estimators for Classification Current Bioinformatics, 2010, 5, 53-67. is a good starting point.
- For spectroscopic data, I did some simulations Beleites, C.; Baumgartner, R.; Bowman, C.; Somorjai, R.; Steiner, G.; Salzer, R. & Sowa, M. G.: Variance reduction in estimating classification error using sparse datasets. Chem.Intell.Lab.Syst., 2005, 79, 91 - 100.
  preprint
I use it regularly, e.g Beleites, C.; Geiger, K.; Kirsch, M.; Sobottka, S. B.; Schackert, G. & Salzer, R.: Raman spectroscopic grading of astrocytoma tissues: using soft reference information.Anal Bioanal Chem, 2011, 400, 2801-2816

Underestimating variance Ultimately, your data set has finite (n = 120) sample size, regardless of how many iterations of bootstrap or cross validation you do.

You have (at least) 2 sources of variance in the resampling (cross validation and out of bootstrap) validation results:
- variance due to finite number of (test) sample
- variance due to instability of the predictions of the surrogate models
If your models are stable, then
- iterations of $k$-fold cross validation were not needed (they don't improve the performance estimate: the average over each run of the cross validation is the same).
- However, the performance estimate is still subject to variance due to the finite number of test samples.
- If your data structure is "simple" (i.e. one single measurement vector for each statistically independent case), you can assume that the test results are the results of a Bernoulli process (coin-throwing) and calculate the finite-test-set variance.
out-of-bootstrap looks at variance between each surrogate model's predictions. That is possible with the cross validation results as well, but it is uncommon. If you do this, you'll see variance due to finite sample size in addition to the instability. However, keep in mind that some pooling has (usually) taken place already: for cross validation usually $\frac{n}{k}$ results are pooled, and for out-of-bootstrap a varying number of left out samples are pooled.
Which makes me personally prefer the cross validation (for the moment) as it is easier to separate instability from finite test sample sizes.

How to Calculate Precision and Recall for Multiclass Classification Using Confusion Matrix

In a 2-hypothesis case, the confusion matrix is usually:

	Declare H1	Declare H0
Is H1	TP	FN
Is H0	FP	TN

where I've used something similar to your notation:

TP = true positive (declare H1 when, in truth, H1),
FN = false negative (declare H0 when, in truth, H1),
FP = false positive
TN = true negative

From the raw data, the values in the table would typically be the counts for each occurrence over the test data. From this, you should be able to compute the quantities you need.

Edit

The generalization to multi-class problems is to sum over rows / columns of the confusion matrix. Given that the matrix is oriented as above, i.e., that a given row of the matrix corresponds to specific value for the "truth", we have:

$\text{Precision}_{~i} = \cfrac{M_{ii}}{\sum_j M_{ji}}$

$\text{Recall}_{~i} = \cfrac{M_{ii}}{\sum_j M_{ij}}$

That is, precision is the fraction of events where we correctly declared $i$ out of all instances where the algorithm declared $i$. Conversely, recall is the fraction of events where we correctly declared $i$ out of all of the cases where the true of state of the world is $i$.

UPDATE

Best Answer

Related Solutions

Classification – Comparing Mean Scores vs. Score Concatenation in Cross Validation

How to Calculate Precision and Recall for Multiclass Classification Using Confusion Matrix

Related Question