I am really confused about how to calculate precision and recall in supervised machine learning algorithm using NB classifier with more than two classes.
Say for example
- I have three classes $A$, $B$, & $C$
- I have $10000$ Documents out of which $2000$ goes to training sample set (class $A=500$, class $B=1000$, class $C=500$)
- Now on basis of above training sample set classify rest $8000$ documents using NB classifier
- After classifying, $1000$ documents goes to class $A$ and $6000$ documents goes to class $B$ and $1000$ documents goes to $C$
- Now how to calculate precision and recall for all individual classes?
I figured out precision and recall for two classes here it goes
Say suppose there are two classes $A$, $B$
Now when a test is executed for documents labeled as $A$ there are two possible classifications for each document: if the classification is $A$, add 1 to “true A” (TA), if the classification is $B$ add 1 to “false B” (FB). Similarly for $B$: if the classification is $A$, add 1 to “false A” (FA) and if classification is B add 1 to “true B” (TB).
I want the same above situation when there are more than two classes
Best Answer
The logic remains the same for several classes, to wit
If a document belonging to A…
If a document belonging to B…
etc.
Precision for A is true positives/(true positives + false positives) where “false positives” are the false positives from all other classes (i.e. the B documents classified as A + the C documents classified as A, etc.).
Recall for A is true positives/(true positives + false negatives) where “false negatives” are all the A documents not classified as A (i.e. the A documents classified as B + the A documents classified as C, etc.) or, equivalently, the total number of A documents minus the number of true positives.
You can also look at all this as a series of confusion matrices with two categories: One with A and non-A (so B and C together), one with B and non-B and finally one with C and non-C.
Most informative is to report precision and recall for each category (especially if you have just a few) but I have seen people combine them in a F1 score and average across categories to obtain some sort of overall performance measure.