Classification Metrics – Recall and Precision in Classification Explained

machine learningmetric

I read some definitions of recall and precision, though it is every time in the context of information retrieval. I was wondering if someone could explain this a bit more in a classification context and maybe illustrate some examples. Say for example I have a binary classifier which gives me a precision of 60% and a recall of 95%, is this a good classifier ?

Maybe to help my goal a bit more, what is the best classifier according to you ? (dataset is imbalanced. Majority class has twice the amount of examples of the minority class)

I'd personally say 5 because of the area under the receiver operator curve.

(as you can see here model 8 has a low precision, very high recall, but one of the lowest AUC_ROC, does that make it a good model ? or a bad one ?)

enter image description here


edit:

I have an excel file with more information:
https://www.dropbox.com/s/6hq7ew5qpztwbo8/comparissoninbalance.xlsx

In this document the area under the reciever operator curve can be found and the area under the precision recall curve. Together with the plots.

Best Answer

Whether a classifier is “good” really depends on

  1. What else is available for your particular problem. Obviously, you want a classifier to be better than random or naive guesses (e.g. classifying everything as belonging to the most common category) but some things are easier to classify than others.
  2. The cost of different mistakes (false alarm vs. false negatives) and the base rate. It's very important to distinguish the two and work out the consequences as it's possible to have a classifier with a very high accuracy (correct classifications on some test sample) that is completely useless in practice (say you are trying to detect a rare disease or some uncommon mischievous behavior and plan to launch some action upon detection; Large-scale testing costs something and the remedial action/treatment also typically involve significant risks/costs so considering that most hits are going to be false positives, from a cost/benefit perspective it might be better to do nothing).

To understand the link between recall/precision on the one hand and sensitivity/specificity on the other hand, it's useful to come back to a confusion matrix:

                      Condition: A             Not A

  Test says “A”       True positive (TP)   |   False positive (FP)
                      ----------------------------------
  Test says “Not A”   False negative (FN)  |    True negative (TN)

Recall is TP/(TP + FN) whereas precision is TP/(TP+FP). This reflects the nature of the problem: In information retrieval, you want to identify as many relevant documents as you can (that's recall) and avoid having to sort through junk (that's precision).

Using the same table, traditional classification metrics are (1) sensitivity defined as TP/(TP + FN) and (2) specificity defined as TN/(FP + TN). So recall and sensitivity are simply synonymous but precision and specificity are defined differently (like recall and sensitivity, specificity is defined with respect to the column total whereas precision refers to the row total). Precision is also sometimes called the “positive predictive value” or, rarely, the “false positive rate” (but see my answer to Relation between true positive, false positive, false negative and true negative regarding the confusion surrounding this definition of the false positive rate).

Interestingly, information retrieval metrics do not involve the “true negative” count. This makes sense: In information retrieval, you don't care about correctly classifying negative instances per se, you just don't want too many of them polluting your results (see also Why doesn't recall take into account true negatives?).

Because of this difference, it's not possible to go from specificity to precision or the other way around without additional information, namely the number of true negatives or, alternatively, the overall proportion of positive and negative cases. However, for the same corpus/test set, higher specificity always means better precision so they are closely related.

In an information retrieval context, the goal is typically to identify a small number of matches from a large number of documents. Because of this asymmetry, it is in fact much more difficult to get a good precision than a good specificity while keeping the sensitivity/recall constant. Since most documents are irrelevant, you have many more occasions for false alarms than true positives and these false alarms can swamp the correct results even if the classifier has impressive accuracy on a balanced test set (this is in fact what's going on in the scenarios I mentioned in my point 2 above). Consequently, you really need to optimize precision and not merely to ensure decent specificity because even impressive-looking rates like 99% or more are sometimes not enough to avoid numerous false alarms.

There is usually a trade-off between sensitivity and specificity (or recall and precision). Intuitively, if you cast a wider net, you will detect more relevant documents/positive cases (higher sensitivity/recall) but you will also get more false alarms (lower specificity and lower precision). If you classify everything in the positive category, you have 100% recall/sensitivity, a bad precision and a mostly useless classifier (“mostly” because if you don't have any other information, it is perfectly reasonable to assume it's not going to rain in a desert and to act accordingly so maybe the output is not useless after all; of course, you don't need a sophisticated model for that).

Considering all this, 60% precision and 95% recall does not sound too bad but, again, this really depends on the domain and what you intend to do with this classifier.


Some additional information regarding the latest comments/edits:

Again, the performance you can expect depends on the specifics (in this context this would be things like the exact set of emotions present in the training set, quality of the picture/video, luminosity, occlusion, head movements, acted or spontaneous videos, person-dependent or person-independent model, etc.) but F1 over .7 sounds good for this type of applications even if the very best models can do better on some data sets [see Valstar, M.F., Mehu, M., Jiang, B., Pantic, M., & Scherer, K. (2012). Meta-analysis of the first facial expression recognition challenge. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 42 (4), 966-979.]

Whether such a model is useful in practice is a completely different question and obviously depends on the application. Note that facial “expression” is itself a complex topic and going from a typical training set (posed expressions) to any real-life situation is not easy. This is rather off-topic on this forum but it will have serious consequences for any practical application you might contemplate.

Finally, head-to-head comparison between models is yet another question. My take on the numbers you presented is that there isn't any dramatic difference between the models (if you refer to the paper I cited above, the range of F1 scores for well-known models in this area is much broader). In practice, technical aspects (simplicity/availability of standard libraries, speed of the different techniques, etc.) would likely decide which model is implemented, except perhaps if the cost/benefits and overall rate make you strongly favor either precision or recall.

Related Question