Solved – Understanding Precision and Recall Results on a Binary Classifier

classificationmachine learningprecision-recallpython

I know the difference between Precision and Recall metrics in Machine Learning. One optimizes on False Positives and other on False Negative. In Statistics it is called as optimizing on Type I or Type II error.

However, I am a but confused on under what circumstances one can get complete opposite Precision and Recall? Like Precision =1 and Recall=0?.

Let me iterate:

precision = true positives / (true positives + false positives)

recall = true positives / (true positives + false negatives)

And here is the Confusion Matrix

     predicted
            (+)   (-)
            ---------
       (+) | TP | FN |
actual      ---------
       (-) | FP | TN |
            ---------

Now, if Precision is =1 for a classifier for the positive (1) class, that means, there are no FP and all predicted labels are TP.

Then how can for the same positive class the Recall be 0? If there is already some TP being predicted, in fact as per Precision all predicted ones are TP only, then for Recall, we would have numerator non zero, then under what circumstances can one get Recall 0 then for the same classifier positive class?

To give some context, I ran a Logistic regression classifier for a binary classification problem. I had some 23K training data with 774 features. 770 features are binary or dummy variables.

And this is the distribution of my class labels:

1    20429
0    12559

And here is the confusion matrix and accuracy values after a 5 fold grid search on some 25 combination of Hyper parameter values.

The mean train scores are [ 0.66883049  0.54314532  0.67008959  0.63187226  0.63100366  0.53165968
  0.54131812  0.55507725  0.5578254   0.57663273  0.57247462  0.57230056
  0.54402055  0.5762753   0.50925733  0.45781882  0.39366017  0.39037968
  0.3919818   0.38878762  0.39784982  0.39506755  0.48238147  0.38932944
  0.39801223]

The mean validation scores are [ 0.66445801  0.54107661  0.66878871  0.63184791  0.6305487   0.5291239
  0.53899788  0.55324585  0.55822615  0.57784418  0.57269066  0.57312373
  0.54536399  0.57593868  0.50790351  0.45727773  0.39318349  0.38906933
  0.39214413  0.38924256  0.39794725  0.39461262  0.4827855   0.38811658
  0.39812048]

The score on held out data is: 0.6687887055562773
 Hyper-Parameters for Best Score : {'alpha': 0.0001, 'l1_ratio': 0.45}

The accuracy of sgd on test data is: 0.37526523188845107

Classification Metrics for sgd :
             precision    recall  f1-score   support

          0       0.38      1.00      0.55      3712
          1       1.00      0.00      0.00      6185

avg / total       0.77      0.38      0.21      9897

Best Answer

Then how can for the same positive class the Recall be 0? If there is already some TP being predicted, in fact as per Precision all predicted ones are TP only, then for Recall, we would have numerator non zero, then under what circumstances can one get Recall 0 then for the same classifier positive class?

Rounding. If you have one true positive, zero false positives, and the metrics themselves are rounded to two decimal places (as shown), precision is 1, recall is zero and accuracy is roughly 1/3 as shown.

Related Solutions

Solved – Interpreting precision/recall results from a LogisticRegression

The logistic model's job is direct probability estimation. Don't use any accuracy measure that requires categorizing the estimated probabilities. More details are here. The $c$-index (concordance probability; AUROC) can help but it supplements rather than replaces measures based on the predicted probabilities and log-likelihood. It would also be advisable to read about optimum decision making and how it uses probabilities coupled with a utility function and does not pre-categorize predicted probabilities. This relates to minimizing expected loss/cost. The ROC curve doesn't help, as it invites the analyst to choose a cutpoint that is divorced from the actual utility function.

Solved – Confidence interval of precision / recall and F1 score

To give some quick answers to the points raised:

The additional "$+4$"observed when calculated the "adjusted version of recall". This comes from the viewing the occurrence of a True Positive as a success and the occurrence of a False Negative as a failure. Using this rationale, we follow the general recommendation from Agresti & Coull (1998) "Approximate is Better than 'Exact' for Interval Estimation of Binomial Proportions" where we "add two successes and two failures" to get adjusted Wald interval. As $2 + 2 = 4$, our total sample size increases from $N$ to $N+4$. As the authors also explain it is "identical to Bayes estimate (mean of posterior distribution) with parameters 2 and 2". (I will revisit this point in a end).

The basis for the standard error formula shown. This is the also motivated by the "add two successes and two failures" rationale. This drives the $+4$ on the denominator. Regarding the numerator, please note that the estimate $\hat{p}$ should be adjusted too as mentioned above. A bit more background: assuming a 0.95 CI with $z^2 = 1.96^2 \approx 4$, then the midpoint of this adjust interval $\frac{(X+\frac{z^2}{2})}{(n +z^2)} \approx \frac{X+2}{n+4}$. Interestingly is also nearly identical to the midpoint of the 0.95 Wilson score interval.

How this carries forward to the calculations of $F_1$. This correction in itself ($+2$ and $+4$) is somewhat trivial to be also to the calculation of the $F_1$ score;we can apply it on how we calculate Precision (or Recall) and use the result. That said, calculating the standard error of the $F_1$ is more involved. While working with the linear combination of independent random variables (RVs) is quite straightforward, $F_1$ is not a linear combination of Precision and Recall. Luckily we can express it as their harmonic mean though (i.e. $F_1 = \frac{2 * Prec * Rec}{Prec + Rec}$. It is preferable to use the harmonic mean of Precision and Recall as the way of expressing $F_1$ because it allows us to work with a product distribution when it comes to the numerator of the $F_1$ score and a ratio distribution when it comes to the overall calculation. (Convenient fact: The mean calculations are straightforward as the expectation of the product of two RVs is the product of their expectations.)

As mentioned A&C (1998) also suggests that this correction is "identical to Bayes estimate (mean of posterior distribution) with parameters 2 and 2". This idea is fully explored in Goutte & Gaussier (2005) A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation. Effectively we can view the whole confusion matrix as the sample realisation from a multinomial distribution. We can bootstrap it, as well as assume priors on it. To simplify things a bit a will just focus on Recall which can be assumed to be just the realisation of a Binomial distribution so we can use a Beta distribution as the prior. If we wanted to use the whole matrix we would use a Dirichlet distribution (i.e. a multivariate beta distribution). For the bootstrap also keep in mind that as Hastie et al. commented in "Elements of Statistical Learning" (Sect. 8.4) directly: "we might think of the bootstrap distribution as a "poor man's" Bayes posterior. By perturbing the data, the bootstrap approximates the Bayesian effect of perturbing the parameters, and is typically much simpler to carry out."

OK, some code to make these concrete.

# Set seed for reproducibility
set.seed(123) 
# Define our observed sample
mySample =c( rep("TP", 250), rep("TN",550), rep("FN", 50), rep("FP", 150))
# Define our "Recall" sampling function
getRec = function(){xx = sample(mySample, replace=TRUE); 
                    sum(xx=="TP")/(sum("TP"==xx) + sum("FN" == xx))}
# Create our bootstrap Recall sample (Give it ~ 50")
myRecalls = replicate(n = 1000000, getRec())
mean(myRecalls) # 0.8333322

# Get our empirical density and calculate the mean
theKDE = density(myRecalls)
plot(theKDE, lty=2, main= "Distribution of Recall")
grid()
abline(v = mean(myRecalls), lty=2)

theSupport = seq(min(theKDE$x), max(theKDE$x), by = 0.0001)

# Explore different priors
# Haldane prior (Complete uncertainty)
lines(col='green', theSupport, dbeta(theSupport, shape1=250+0, shape2=50+0))
maxHP = theSupport[which.max(dbeta(theSupport, shape1=250+0, shape2=50+0))]
abline(v = maxHP, col='green')

# Flat prior
lines(col='cyan', theSupport, dbeta(theSupport, shape1=250+1, shape2=50+1))
maxFP = theSupport[which.max(dbeta(theSupport, shape1=250+1, shape2=50+1))]
abline(v = maxFP, col='cyan')

# A&C suggestion
lines(col='red', theSupport, dbeta(theSupport, shape1=250+2, shape2=50+2))
maxAC = theSupport[which.max(dbeta(theSupport, shape1=250+2, shape2=50+2))]
abline(v = maxAC, col='red')

legend( "topright", lty=c(2,1,1,1),
        legend = c(paste0("Boostrap Mean: (", signif(mean(myRecalls),4), ")"), 
                   paste0("HP Posterior MAP: (", signif(maxHP,4), ")"), 
                   paste0("FP Posterior MAP: (", signif(maxFP,4), ")"), 
                   paste0("A&C Posterior MAP: (", signif(maxAC,4), ")")),
        col=c("black","green","cyan", "red"))

As it can be seen our flat prior's (Beta distribution $B(1,1)$) posterior MAP and our bootstrap estimates effectively coincide. Similarly, our A&C posterior MAP shows the strongest shrinkage towards a mean of 0.50 while the Haldane prior's (Beta distribution $B(0,0)$) posterior MAP is the most optimistic for our Recall. Notice that if we accept a particular posterior, the actual 0.95 CI calculation becomes trivial as we can directly get it from the quantile functions. For example, assuming a flat prior qbeta(1-0.025, 251,51) will give us the upper 0.95 CI for the posterior as 0.8711659. Similarly the posterior mean is estimated as $\frac{\alpha}{\alpha + \beta}$ or in the case of a flat prior as 0.8311 (251/(251+51)).

Best Answer

Related Solutions

Solved – Interpreting precision/recall results from a LogisticRegression

Solved – Confidence interval of precision / recall and F1 score

Related Question