Solved – Interpreting precision/recall results from a LogisticRegression

machine learningprecision-recallregressionscikit learn

I computed a word vector model on medical reports on a critical disease and run a logistic regression on a binary classifier. Text data is labeled with 1=successful and 0=non-successful for the true outcome of the treatment. I train with 90% of my data and test on 10%. The dataset contains about 30% successful and 70% unsuccessful cases (N= ca. 20,000).

Now, I got following results from my scikit-learn function:

LogR = LogisticRegression(penalty='l2', C= 1.0, max_iter=100, n_jobs=1, tol=0.0001)
LogR = LogR.fit(x_train, y_train)
y_pred = LogR.predict(x_test)
print(classification_report(y_test, y_pred, digits=3))

             precision    recall  f1-score   support

        0.0     0.6197    0.9543    0.7514      3413
        1.0     0.3305    0.0371    0.0667      2076

avg / total     0.5103    0.6074    0.4924      5489

I would like to make sure, how to interpret the results. In particular, what it implies, practically. Note that I looked into the theory of "precision vs. recall tradeoff", the basic definition of accuracy, precision, and recall, as well as F-score weights.

As I interpret the results,

  • Overall accuracy seems like flipping a coin, or, as the averaged F-score suggests, even a bit worse.
  • The model has a higher precision in classifying unsuccessful treatments ("0",61%). In particular the 95% recall suggests that it almost does not miss out any of the unsuccessful treatments in the whole test sample.
  • However, the model is almost not able to identify successful cases ("1"). It only captures about 33% of the potential candidates and includes many false positives. In addition, the proportion of true positive classifications that are true positive is very, very low (3%).

Assuming you would be a doctor, would these prepositions be correct:

  • Intution on high precision, low recall: How many treatments are predicted correctly, at the risk to include a few false positives?
  • Intution on low precision, high recall: How many treatments are predicted correctly, at the risk to include a few false negatives? (i.e. how high are the chances to have a prediction for a successful outcomes, whereas truth turns out to be non-successful)

It would be great to have a bit of feedback, or correction of my interpretations.

Best Answer

The logistic model's job is direct probability estimation. Don't use any accuracy measure that requires categorizing the estimated probabilities. More details are here. The $c$-index (concordance probability; AUROC) can help but it supplements rather than replaces measures based on the predicted probabilities and log-likelihood. It would also be advisable to read about optimum decision making and how it uses probabilities coupled with a utility function and does not pre-categorize predicted probabilities. This relates to minimizing expected loss/cost. The ROC curve doesn't help, as it invites the analyst to choose a cutpoint that is divorced from the actual utility function.

Related Question