Solved – Interpreting precision/recall results from a LogisticRegression

machine learningprecision-recallregressionscikit learn

I computed a word vector model on medical reports on a critical disease and run a logistic regression on a binary classifier. Text data is labeled with 1=successful and 0=non-successful for the true outcome of the treatment. I train with 90% of my data and test on 10%. The dataset contains about 30% successful and 70% unsuccessful cases (N= ca. 20,000).

Now, I got following results from my scikit-learn function:

LogR = LogisticRegression(penalty='l2', C= 1.0, max_iter=100, n_jobs=1, tol=0.0001)
LogR = LogR.fit(x_train, y_train)
y_pred = LogR.predict(x_test)
print(classification_report(y_test, y_pred, digits=3))

             precision    recall  f1-score   support

        0.0     0.6197    0.9543    0.7514      3413
        1.0     0.3305    0.0371    0.0667      2076

avg / total     0.5103    0.6074    0.4924      5489

I would like to make sure, how to interpret the results. In particular, what it implies, practically. Note that I looked into the theory of "precision vs. recall tradeoff", the basic definition of accuracy, precision, and recall, as well as F-score weights.

As I interpret the results,

Overall accuracy seems like flipping a coin, or, as the averaged F-score suggests, even a bit worse.
The model has a higher precision in classifying unsuccessful treatments ("0",61%). In particular the 95% recall suggests that it almost does not miss out any of the unsuccessful treatments in the whole test sample.
However, the model is almost not able to identify successful cases ("1"). It only captures about 33% of the potential candidates and includes many false positives. In addition, the proportion of true positive classifications that are true positive is very, very low (3%).

Assuming you would be a doctor, would these prepositions be correct:

Intution on high precision, low recall: How many treatments are predicted correctly, at the risk to include a few false positives?
Intution on low precision, high recall: How many treatments are predicted correctly, at the risk to include a few false negatives? (i.e. how high are the chances to have a prediction for a successful outcomes, whereas truth turns out to be non-successful)

It would be great to have a bit of feedback, or correction of my interpretations.

Best Answer

The logistic model's job is direct probability estimation. Don't use any accuracy measure that requires categorizing the estimated probabilities. More details are here. The $c$-index (concordance probability; AUROC) can help but it supplements rather than replaces measures based on the predicted probabilities and log-likelihood. It would also be advisable to read about optimum decision making and how it uses probabilities coupled with a utility function and does not pre-categorize predicted probabilities. This relates to minimizing expected loss/cost. The ROC curve doesn't help, as it invites the analyst to choose a cutpoint that is divorced from the actual utility function.

Related Solutions

Solved – Harmonic mean of precision, recall and specificity

Well i think instead of adding precision, recall and specificity, you can only use harmonic mean of specificity and sensitivity, that will help you capture all four TP, TN, FP and FN in a single quantity.
The harmonic mean between specificity and sensitivity referred as F-measure.
And when it comes to interpretation F1score doesn't have any interpretation, please correct me if i am wrong. and i think MCC gives you +1, 0 and -1, which is of no use if you are using it for evaluating a machine learning model while training or testing.

Solved – Confidence interval of precision / recall and F1 score

To give some quick answers to the points raised:

The additional "$+4$"observed when calculated the "adjusted version of recall". This comes from the viewing the occurrence of a True Positive as a success and the occurrence of a False Negative as a failure. Using this rationale, we follow the general recommendation from Agresti & Coull (1998) "Approximate is Better than 'Exact' for Interval Estimation of Binomial Proportions" where we "add two successes and two failures" to get adjusted Wald interval. As $2 + 2 = 4$, our total sample size increases from $N$ to $N+4$. As the authors also explain it is "identical to Bayes estimate (mean of posterior distribution) with parameters 2 and 2". (I will revisit this point in a end).

The basis for the standard error formula shown. This is the also motivated by the "add two successes and two failures" rationale. This drives the $+4$ on the denominator. Regarding the numerator, please note that the estimate $\hat{p}$ should be adjusted too as mentioned above. A bit more background: assuming a 0.95 CI with $z^2 = 1.96^2 \approx 4$, then the midpoint of this adjust interval $\frac{(X+\frac{z^2}{2})}{(n +z^2)} \approx \frac{X+2}{n+4}$. Interestingly is also nearly identical to the midpoint of the 0.95 Wilson score interval.

How this carries forward to the calculations of $F_1$. This correction in itself ($+2$ and $+4$) is somewhat trivial to be also to the calculation of the $F_1$ score;we can apply it on how we calculate Precision (or Recall) and use the result. That said, calculating the standard error of the $F_1$ is more involved. While working with the linear combination of independent random variables (RVs) is quite straightforward, $F_1$ is not a linear combination of Precision and Recall. Luckily we can express it as their harmonic mean though (i.e. $F_1 = \frac{2 * Prec * Rec}{Prec + Rec}$. It is preferable to use the harmonic mean of Precision and Recall as the way of expressing $F_1$ because it allows us to work with a product distribution when it comes to the numerator of the $F_1$ score and a ratio distribution when it comes to the overall calculation. (Convenient fact: The mean calculations are straightforward as the expectation of the product of two RVs is the product of their expectations.)

As mentioned A&C (1998) also suggests that this correction is "identical to Bayes estimate (mean of posterior distribution) with parameters 2 and 2". This idea is fully explored in Goutte & Gaussier (2005) A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation. Effectively we can view the whole confusion matrix as the sample realisation from a multinomial distribution. We can bootstrap it, as well as assume priors on it. To simplify things a bit a will just focus on Recall which can be assumed to be just the realisation of a Binomial distribution so we can use a Beta distribution as the prior. If we wanted to use the whole matrix we would use a Dirichlet distribution (i.e. a multivariate beta distribution). For the bootstrap also keep in mind that as Hastie et al. commented in "Elements of Statistical Learning" (Sect. 8.4) directly: "we might think of the bootstrap distribution as a "poor man's" Bayes posterior. By perturbing the data, the bootstrap approximates the Bayesian effect of perturbing the parameters, and is typically much simpler to carry out."

OK, some code to make these concrete.

# Set seed for reproducibility
set.seed(123) 
# Define our observed sample
mySample =c( rep("TP", 250), rep("TN",550), rep("FN", 50), rep("FP", 150))
# Define our "Recall" sampling function
getRec = function(){xx = sample(mySample, replace=TRUE); 
                    sum(xx=="TP")/(sum("TP"==xx) + sum("FN" == xx))}
# Create our bootstrap Recall sample (Give it ~ 50")
myRecalls = replicate(n = 1000000, getRec())
mean(myRecalls) # 0.8333322

# Get our empirical density and calculate the mean
theKDE = density(myRecalls)
plot(theKDE, lty=2, main= "Distribution of Recall")
grid()
abline(v = mean(myRecalls), lty=2)

theSupport = seq(min(theKDE$x), max(theKDE$x), by = 0.0001)

# Explore different priors
# Haldane prior (Complete uncertainty)
lines(col='green', theSupport, dbeta(theSupport, shape1=250+0, shape2=50+0))
maxHP = theSupport[which.max(dbeta(theSupport, shape1=250+0, shape2=50+0))]
abline(v = maxHP, col='green')

# Flat prior
lines(col='cyan', theSupport, dbeta(theSupport, shape1=250+1, shape2=50+1))
maxFP = theSupport[which.max(dbeta(theSupport, shape1=250+1, shape2=50+1))]
abline(v = maxFP, col='cyan')

# A&C suggestion
lines(col='red', theSupport, dbeta(theSupport, shape1=250+2, shape2=50+2))
maxAC = theSupport[which.max(dbeta(theSupport, shape1=250+2, shape2=50+2))]
abline(v = maxAC, col='red')

legend( "topright", lty=c(2,1,1,1),
        legend = c(paste0("Boostrap Mean: (", signif(mean(myRecalls),4), ")"), 
                   paste0("HP Posterior MAP: (", signif(maxHP,4), ")"), 
                   paste0("FP Posterior MAP: (", signif(maxFP,4), ")"), 
                   paste0("A&C Posterior MAP: (", signif(maxAC,4), ")")),
        col=c("black","green","cyan", "red"))

As it can be seen our flat prior's (Beta distribution $B(1,1)$) posterior MAP and our bootstrap estimates effectively coincide. Similarly, our A&C posterior MAP shows the strongest shrinkage towards a mean of 0.50 while the Haldane prior's (Beta distribution $B(0,0)$) posterior MAP is the most optimistic for our Recall. Notice that if we accept a particular posterior, the actual 0.95 CI calculation becomes trivial as we can directly get it from the quantile functions. For example, assuming a flat prior qbeta(1-0.025, 251,51) will give us the upper 0.95 CI for the posterior as 0.8711659. Similarly the posterior mean is estimated as $\frac{\alpha}{\alpha + \beta}$ or in the case of a flat prior as 0.8311 (251/(251+51)).

Best Answer

Related Solutions

Solved – Harmonic mean of precision, recall and specificity

Solved – Confidence interval of precision / recall and F1 score

Related Question