Solved – Precision vs Recall acceptable Limits

classificationmachine learningprecision-recall

I have few queries on Precision and Recall in Classification of machine learning.

While I was reading, I found that high Precision will result low recall and vice versa.

However if someone ask how much % of Precision is acceptable, answer could be?
Is recall should/Can be greater than Precision? Or Precision should/can be greater than recall?
To be consider Precision , should it be? 95% or if we consider recall should it be > 95%.
In my results of test, I got 100% in recall, can recall or Precision be greater than 100?

Best Answer

To explain this, I would use an example. I trained a model that classifies bananas and not bananas. When I evaluate the model, I use 20 pieces of fruit, 10 bananas and 10 other fruits. 8 bananas were classified as bananas, and the other 2 as "other fruits". 7 pieces of fruit that aren't bananas were classified as "other fruits", and the other 3 as bananas.

If "banana" is our positive class and "other fruits" our negative class, we will have 8 True Positives, 2 False Negatives, 7 True Negatives and 3 False Positives.

Given this, we can calculate the precision as True Positives / (True Positives + False Positives), or in my own words, the proportion of the Predicted as Positive that are really Positive. The recall can be calculated as True Positives / (True Positives + False Negatives) or in my own words, the proportion of Positive samples that have been identified as Positive.

So we can now answer the 4th question, precision and recall can never be greater than 1 because the denominator of the equation will be equal or greater than the numerator always.

Question 1 is difficult to answer because it depends on the problem. In many cases, you will need a precision or recall greater and 0.9 to make sure that you are prediction correctly, but in other cases, we can accept lower values. This question doesn't have a unique answer.

Question 2, again, it depends. In some cases, you will want to have a lot of precision even if this means that the recall will be lower because you want that all your positive predictions must be positive (think in a system that predicts if a person has cancer, you don't want to give chemo to a healthy person). On the other hand, in some cases, you will want a high recall because it means that no positive sample will be classified as negative (think in a computer virus detector, it's better to classify a file as dangerous to be sure that no virus will infect our computer).

Question 3, same as the 2 previous questions.

Related Solutions

Solved – How to generalize precision, recall and F-score to non-classification problem

Ok, so your question is: what evaluation metrics should one use for Collaborative Filtering. This is a continuous-valued response (unless you've dichotomized it, which some people do recommend), and you want to use some kind of a residual-based loss function such as MSE. The Netflix Prize competition used Root MSE.

There's lots of materials on evaluation of CF. You might also watch Andrew Ng's lectures on CF, see under XVI. Recommender Systems.

Also, make sure that your MSE computation is working right: make a fake recommender that peeks at the validation set to produce correct responses at least 10%, 30%, 50%, 70% of the time and see if the MSE drops.

Classification Metrics – Recall and Precision in Classification Explained

Whether a classifier is “good” really depends on

What else is available for your particular problem. Obviously, you want a classifier to be better than random or naive guesses (e.g. classifying everything as belonging to the most common category) but some things are easier to classify than others.
The cost of different mistakes (false alarm vs. false negatives) and the base rate. It's very important to distinguish the two and work out the consequences as it's possible to have a classifier with a very high accuracy (correct classifications on some test sample) that is completely useless in practice (say you are trying to detect a rare disease or some uncommon mischievous behavior and plan to launch some action upon detection; Large-scale testing costs something and the remedial action/treatment also typically involve significant risks/costs so considering that most hits are going to be false positives, from a cost/benefit perspective it might be better to do nothing).

To understand the link between recall/precision on the one hand and sensitivity/specificity on the other hand, it's useful to come back to a confusion matrix:

                      Condition: A             Not A

  Test says “A”       True positive (TP)   |   False positive (FP)
                      ----------------------------------
  Test says “Not A”   False negative (FN)  |    True negative (TN)

Recall is TP/(TP + FN) whereas precision is TP/(TP+FP). This reflects the nature of the problem: In information retrieval, you want to identify as many relevant documents as you can (that's recall) and avoid having to sort through junk (that's precision).

Using the same table, traditional classification metrics are (1) sensitivity defined as TP/(TP + FN) and (2) specificity defined as TN/(FP + TN). So recall and sensitivity are simply synonymous but precision and specificity are defined differently (like recall and sensitivity, specificity is defined with respect to the column total whereas precision refers to the row total). Precision is also sometimes called the “positive predictive value” or, rarely, the “false positive rate” (but see my answer to Relation between true positive, false positive, false negative and true negative regarding the confusion surrounding this definition of the false positive rate).

Interestingly, information retrieval metrics do not involve the “true negative” count. This makes sense: In information retrieval, you don't care about correctly classifying negative instances per se, you just don't want too many of them polluting your results (see also Why doesn't recall take into account true negatives?).

Because of this difference, it's not possible to go from specificity to precision or the other way around without additional information, namely the number of true negatives or, alternatively, the overall proportion of positive and negative cases. However, for the same corpus/test set, higher specificity always means better precision so they are closely related.

In an information retrieval context, the goal is typically to identify a small number of matches from a large number of documents. Because of this asymmetry, it is in fact much more difficult to get a good precision than a good specificity while keeping the sensitivity/recall constant. Since most documents are irrelevant, you have many more occasions for false alarms than true positives and these false alarms can swamp the correct results even if the classifier has impressive accuracy on a balanced test set (this is in fact what's going on in the scenarios I mentioned in my point 2 above). Consequently, you really need to optimize precision and not merely to ensure decent specificity because even impressive-looking rates like 99% or more are sometimes not enough to avoid numerous false alarms.

There is usually a trade-off between sensitivity and specificity (or recall and precision). Intuitively, if you cast a wider net, you will detect more relevant documents/positive cases (higher sensitivity/recall) but you will also get more false alarms (lower specificity and lower precision). If you classify everything in the positive category, you have 100% recall/sensitivity, a bad precision and a mostly useless classifier (“mostly” because if you don't have any other information, it is perfectly reasonable to assume it's not going to rain in a desert and to act accordingly so maybe the output is not useless after all; of course, you don't need a sophisticated model for that).

Considering all this, 60% precision and 95% recall does not sound too bad but, again, this really depends on the domain and what you intend to do with this classifier.

Some additional information regarding the latest comments/edits:

Again, the performance you can expect depends on the specifics (in this context this would be things like the exact set of emotions present in the training set, quality of the picture/video, luminosity, occlusion, head movements, acted or spontaneous videos, person-dependent or person-independent model, etc.) but F1 over .7 sounds good for this type of applications even if the very best models can do better on some data sets [see Valstar, M.F., Mehu, M., Jiang, B., Pantic, M., & Scherer, K. (2012). Meta-analysis of the first facial expression recognition challenge. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 42 (4), 966-979.]

Whether such a model is useful in practice is a completely different question and obviously depends on the application. Note that facial “expression” is itself a complex topic and going from a typical training set (posed expressions) to any real-life situation is not easy. This is rather off-topic on this forum but it will have serious consequences for any practical application you might contemplate.

Finally, head-to-head comparison between models is yet another question. My take on the numbers you presented is that there isn't any dramatic difference between the models (if you refer to the paper I cited above, the range of F1 scores for well-known models in this area is much broader). In practice, technical aspects (simplicity/availability of standard libraries, speed of the different techniques, etc.) would likely decide which model is implemented, except perhaps if the cost/benefits and overall rate make you strongly favor either precision or recall.

Best Answer

Related Solutions

Solved – How to generalize precision, recall and F-score to non-classification problem

Classification Metrics – Recall and Precision in Classification Explained

Related Question