I'm wondering how to calculate precision and recall measures for multiclass multilabel classification, i.e. classification where there are more than two labels, and where each instance can have multiple labels?
Solved – How to compute precision/recall for multiclass-multilabel classification
classificationmachine learningmulti-classprecision-recall
Related Solutions
In a 2-hypothesis case, the confusion matrix is usually:
Declare H1 | Declare H0 | |
---|---|---|
Is H1 | TP | FN |
Is H0 | FP | TN |
where I've used something similar to your notation:
- TP = true positive (declare H1 when, in truth, H1),
- FN = false negative (declare H0 when, in truth, H1),
- FP = false positive
- TN = true negative
From the raw data, the values in the table would typically be the counts for each occurrence over the test data. From this, you should be able to compute the quantities you need.
Edit
The generalization to multi-class problems is to sum over rows / columns of the confusion matrix. Given that the matrix is oriented as above, i.e., that a given row of the matrix corresponds to specific value for the "truth", we have:
$\text{Precision}_{~i} = \cfrac{M_{ii}}{\sum_j M_{ji}}$
$\text{Recall}_{~i} = \cfrac{M_{ii}}{\sum_j M_{ij}}$
That is, precision is the fraction of events where we correctly declared $i$ out of all instances where the algorithm declared $i$. Conversely, recall is the fraction of events where we correctly declared $i$ out of all of the cases where the true of state of the world is $i$.
The logic remains the same for several classes, to wit
If a document belonging to A…
- is classified as A, it's a true positive/true A
- is classified as B, it's a false positive for B/false B and a false negative for A
- is classified as C, it's a false positive for C/false C and a false negative for A
If a document belonging to B…
- is classified as A, it's a false positive for A/false A and a false negative for B
- is classified as B, it's a true positive for B
- is classified as C, it's a false positive for C/false C and a false negative for B
etc.
Precision for A is true positives/(true positives + false positives) where “false positives” are the false positives from all other classes (i.e. the B documents classified as A + the C documents classified as A, etc.).
Recall for A is true positives/(true positives + false negatives) where “false negatives” are all the A documents not classified as A (i.e. the A documents classified as B + the A documents classified as C, etc.) or, equivalently, the total number of A documents minus the number of true positives.
You can also look at all this as a series of confusion matrices with two categories: One with A and non-A (so B and C together), one with B and non-B and finally one with C and non-C.
Most informative is to report precision and recall for each category (especially if you have just a few) but I have seen people combine them in a F1 score and average across categories to obtain some sort of overall performance measure.
Best Answer
For multi-label classification you have two ways to go First consider the following.
Example based
The metrics are computed in a per datapoint manner. For each predicted label its only its score is computed, and then these scores are aggregated over all the datapoints.
There are other metrics as well.
Label based
Here the things are done labels-wise. For each label the metrics (eg. precision, recall) are computed and then these label-wise metrics are aggregated. Hence, in this case you end up computing the precision/recall for each label over the entire dataset, as you do for a binary classification (as each label has a binary assignment), then aggregate it.
The easy way is to present the general form.
This is just an extension of the standard multi-class equivalent.
Macro averaged $\frac{1}{q}\sum_{j=1}^{q}B(TP_{j},FP_{j},TN_{j},FN_{j})$
Micro averaged $B(\sum_{j=1}^{q}TP_{j},\sum_{j=1}^{q}FP_{j},\sum_{j=1}^{q}TN_{j},\sum_{j=1}^{q}FN_{j})$
Here the $TP_{j},FP_{j},TN_{j},FN_{j}$ are the true positive, false positive, true negative and false negative counts respectively for only the $j^{th}$ label.
Here $B$ stands for any of the confusion-matrix based metric. In your case you would plug in the standard precision and recall formulas. For macro average you pass in the per label count and then sum, for micro average you average the counts first, then apply your metric function.
You might be interested to have a look into the code for the mult-label metrics here , which a part of the package mldr in R. Also you might be interested to look into the Java multi-label library MULAN.
This is a nice paper to get into the different metrics: A Review on Multi-Label Learning Algorithms