Classification Metrics – Recall and Precision in Classification Explained

machine learningmetric

I read some definitions of recall and precision, though it is every time in the context of information retrieval. I was wondering if someone could explain this a bit more in a classification context and maybe illustrate some examples. Say for example I have a binary classifier which gives me a precision of 60% and a recall of 95%, is this a good classifier ?

Maybe to help my goal a bit more, what is the best classifier according to you ? (dataset is imbalanced. Majority class has twice the amount of examples of the minority class)

I'd personally say 5 because of the area under the receiver operator curve.

(as you can see here model 8 has a low precision, very high recall, but one of the lowest AUC_ROC, does that make it a good model ? or a bad one ?)

enter image description here

edit:

I have an excel file with more information:
https://www.dropbox.com/s/6hq7ew5qpztwbo8/comparissoninbalance.xlsx

In this document the area under the reciever operator curve can be found and the area under the precision recall curve. Together with the plots.

Best Answer

Whether a classifier is “good” really depends on

What else is available for your particular problem. Obviously, you want a classifier to be better than random or naive guesses (e.g. classifying everything as belonging to the most common category) but some things are easier to classify than others.
The cost of different mistakes (false alarm vs. false negatives) and the base rate. It's very important to distinguish the two and work out the consequences as it's possible to have a classifier with a very high accuracy (correct classifications on some test sample) that is completely useless in practice (say you are trying to detect a rare disease or some uncommon mischievous behavior and plan to launch some action upon detection; Large-scale testing costs something and the remedial action/treatment also typically involve significant risks/costs so considering that most hits are going to be false positives, from a cost/benefit perspective it might be better to do nothing).

To understand the link between recall/precision on the one hand and sensitivity/specificity on the other hand, it's useful to come back to a confusion matrix:

                      Condition: A             Not A

  Test says “A”       True positive (TP)   |   False positive (FP)
                      ----------------------------------
  Test says “Not A”   False negative (FN)  |    True negative (TN)

Recall is TP/(TP + FN) whereas precision is TP/(TP+FP). This reflects the nature of the problem: In information retrieval, you want to identify as many relevant documents as you can (that's recall) and avoid having to sort through junk (that's precision).

Using the same table, traditional classification metrics are (1) sensitivity defined as TP/(TP + FN) and (2) specificity defined as TN/(FP + TN). So recall and sensitivity are simply synonymous but precision and specificity are defined differently (like recall and sensitivity, specificity is defined with respect to the column total whereas precision refers to the row total). Precision is also sometimes called the “positive predictive value” or, rarely, the “false positive rate” (but see my answer to Relation between true positive, false positive, false negative and true negative regarding the confusion surrounding this definition of the false positive rate).

Interestingly, information retrieval metrics do not involve the “true negative” count. This makes sense: In information retrieval, you don't care about correctly classifying negative instances per se, you just don't want too many of them polluting your results (see also Why doesn't recall take into account true negatives?).

Because of this difference, it's not possible to go from specificity to precision or the other way around without additional information, namely the number of true negatives or, alternatively, the overall proportion of positive and negative cases. However, for the same corpus/test set, higher specificity always means better precision so they are closely related.

In an information retrieval context, the goal is typically to identify a small number of matches from a large number of documents. Because of this asymmetry, it is in fact much more difficult to get a good precision than a good specificity while keeping the sensitivity/recall constant. Since most documents are irrelevant, you have many more occasions for false alarms than true positives and these false alarms can swamp the correct results even if the classifier has impressive accuracy on a balanced test set (this is in fact what's going on in the scenarios I mentioned in my point 2 above). Consequently, you really need to optimize precision and not merely to ensure decent specificity because even impressive-looking rates like 99% or more are sometimes not enough to avoid numerous false alarms.

There is usually a trade-off between sensitivity and specificity (or recall and precision). Intuitively, if you cast a wider net, you will detect more relevant documents/positive cases (higher sensitivity/recall) but you will also get more false alarms (lower specificity and lower precision). If you classify everything in the positive category, you have 100% recall/sensitivity, a bad precision and a mostly useless classifier (“mostly” because if you don't have any other information, it is perfectly reasonable to assume it's not going to rain in a desert and to act accordingly so maybe the output is not useless after all; of course, you don't need a sophisticated model for that).

Considering all this, 60% precision and 95% recall does not sound too bad but, again, this really depends on the domain and what you intend to do with this classifier.

Some additional information regarding the latest comments/edits:

Again, the performance you can expect depends on the specifics (in this context this would be things like the exact set of emotions present in the training set, quality of the picture/video, luminosity, occlusion, head movements, acted or spontaneous videos, person-dependent or person-independent model, etc.) but F1 over .7 sounds good for this type of applications even if the very best models can do better on some data sets [see Valstar, M.F., Mehu, M., Jiang, B., Pantic, M., & Scherer, K. (2012). Meta-analysis of the first facial expression recognition challenge. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 42 (4), 966-979.]

Whether such a model is useful in practice is a completely different question and obviously depends on the application. Note that facial “expression” is itself a complex topic and going from a typical training set (posed expressions) to any real-life situation is not easy. This is rather off-topic on this forum but it will have serious consequences for any practical application you might contemplate.

Finally, head-to-head comparison between models is yet another question. My take on the numbers you presented is that there isn't any dramatic difference between the models (if you refer to the paper I cited above, the range of F1 scores for well-known models in this area is much broader). In practice, technical aspects (simplicity/availability of standard libraries, speed of the different techniques, etc.) would likely decide which model is implemented, except perhaps if the cost/benefits and overall rate make you strongly favor either precision or recall.

Related Solutions

Solved – Optimising for Precision-Recall curves under class imbalance

The ROC curve is insensitive to changes in class imbalance; see Fawcett (2004) "ROC Graphs: Notes and Practical Considerations for Researchers".
Up-sampling the low-frequency class is a reasonable approach.
There are many other ways of dealing with class imbalance. Boosting and bagging are two techniques that come to mind. This seems like a relevant recent study: Comparing Boosting and Bagging Techniques With Noisy and Imbalanced Data

P.S. Neat problem; I'd love to know how it turns out.

Solved – outlier detection: area under precision recall curve

The problem is with your example that it is possible to have zero $tp$ and zero $fp$, therefore the precision $prec = tp/(tp+fp)$ becomes undefined because we divide by zero. Because of this the PR curve only contains points for one $x$-value, and therefore the area under the PR curve becomes zero for your example.

You can see this by plotting the PR curve:

[X,Y,T,PR] = perfcurve(label,score,1, 'xCrit', 'reca', 'yCrit', 'prec') % PR = 0
figure
scatter(X,Y)
xlabel('recall')
ylabel('precision')

So plotting a PR curve doesn't really work well when all your scores are the same.

To gain more insights between the difference of the PR curve and the ROC curve, compare these two prediction lists. We consider the case where we predict all zeros, and predict one 1, but it should be zero (score1). This one doesnt work very well, it predicts 0 everywhere, except for one object where it predicts 1 where it should be zero. We consider another case, where we predict one 1 correctly, and the rest we classify as 0. Here we thus predict 1 one correctly, and the rest we classify as 0. We compare the area under the PR curve and the area under the ROC.

outlier = 1;
normal  = 0;
% 99% normal data 1% outlier
label = normal*ones(1000,1); 
label(1:10) = outlier;  

%label = real( rand(1000,1) > 0.99 );     % 99% normal data 1% outlier
score1 = [zeros(999,1);1]; % predict everything as zero, and one mistake 
score2 = [1;zeros(999,1)]; % predict everything as zero, and one 1 correct 

[X,Y,T,AUC1] = perfcurve(label,score1,1)
% AUC1 = 0.5
[X,Y,T,AUC2] = perfcurve(label,score2,1)
% AUC2 = 0.55

[X,Y,T,PR1] = perfcurve(label,score1,1, 'xCrit', 'reca', 'yCrit', 'prec') 
% PR1 = 0.005 
[X,Y,T,PR2] = perfcurve(label,score2,1, 'xCrit', 'reca', 'yCrit', 'prec') 
% PR2 = 0.4545

Observe that the AUC varies little between score1 and score2. However, the area under the PR curve is significantly different. It rewards score2 much more than score1. This indicates it is better suited to outlier detection: it rewards detecting the outlier much more than the AUC. In case of outlier detection you would prefer score2 much more, since it predicts the 1 that you want to detect correctly, while score1 predicts a 1 for a zero and never catches any outliers.

In general, the AUC is more informative to give an idea how well your predictions work for varying priors. Thus the AUC characterizes how well the classifier works for varying number of ones and zeros.

The PR curves indicates more well how it performs for the current class imbalance considered. Therefore the PR curve is more interesting for you: it takes into account there are little 1's in your dataset than 0's. Because you are only interested in this case when you are interested in outlier detection, the PR curve is more informative.

While the AUC characterizes how your predictions would do if there are much more 1's as well.

For more information see also:

https://www.quora.com/What-is-Precision-Recall-PR-curve

ROC vs precision-and-recall curves

Finally, you might be interested in how to compute an ROC / PR curve, a detailed explanation is given here for ROC curves:

http://blogs.sas.com/content/iml/2011/07/29/computing-an-roc-curve-from-basic-principles.html

Best Answer

Related Solutions

Solved – Optimising for Precision-Recall curves under class imbalance

Solved – outlier detection: area under precision recall curve

Related Question