How to determine if the predicted probabilities from sklearn logistic regresssion are accurate

I am totally new to machine learning and I'm trying to use scikit-learn to make a simple logistic regression model with 1 input variable (X) and a binary outcome (Y). My data consists of 325 samples, with 39 successes and 286 failures. The data was split into a training and test (30%) set.

My goal is actually to obtain the predicted probabilities of success for any given X based on my data, not for classification prediction per se. That is, I will be taking the predicted probabilities for use in a separate model I'm building and won't be using the logistic regression as a classifier at all. So it's important that the predicted probabilities actually fit the data.

However, I am having some trouble understanding whether or not my model is a good fit to the data, or if the computed probabilities are actually accurate.

I am getting the following metrics:

  • Classification accuracy: metrics.accuracy_score(Y_test, predicted) = 0.92.
    My understanding of this metric is that the model has a high chance of making correct predictions, so it looks to me like the model is a good fit.

  • Log loss: cross_val_score(LogisticRegression(), X, Y, scoring='neg_log_loss', cv=10) = -0.26
    This is probably the most confusing metric for me, and apparently the most important as it is the accuracy of the predicted probabilities. I know that the closer to zero the score is the better – but how close is close enough?

  • AUC: metrics.roc_auc_score(Y_test, probs[:, 1]) = 0.9. Again, this looks good, since the closer the ROC score is to 1 the better.

  • Confusion Matrix: metrics.confusion_matrix(Y_test, predicted) =

            [  88,  0]
               [8,  2]

    My understanding here is that the diagonal gives the numbers of correct predictions in the training set so this looks ok.

  • Report: metrics.classification_report(Y_test, predicted) =

                precision    recall  f1-score   support
    0.0       0.92      1.00      0.96        88
    1.0       1.00      0.20      0.33        10
    avg / total       0.93      0.92      0.89        98

    According to this classification report, the model has good precision so it is a good fit.
    I am not sure how to interpret the recall or if this report is bad news for my model- the sklearn documentation states that the recall is a models ability to find all positive samples – so a score of 0.2 for a prediction of 1 would mean that it only finds the positives 20% of the time? That sounds like a really bad fit to the data.

I'd really appreciate if someone could clarify that I am interpeting these metrics the right way – and perhaps shed some light on whether my model is good or bogus. Also, if there are any other tests I could do to determine if the computed probabilities are accurate please let me know.

If these aren't good metric scores, I'd really appreciate some direction on where to go next in terms of improvement.


Best Answer

Sklearn has a module for calibration curves and also Brier score. Both of these address the issue of probability accuracy. A calibration curve is a scatter plot where one axis is predicted probability and the other is the true probability. This is computed by taking, e.g., 100 cases, Q=40 of which are truly positive. The actaul probability of these 100 being positive is 0.4. We put each of these cases through the fit model and produce a probability estimate for each. Ideally, if we averaged these 100 estimates we would get 0.4. With a calibration curve, just repeat this process for different Qs and you should get a diagonal line. Brier score is MSE wrt this line.

You are correct that Log loss is difficult to interpret and also related.

Precision/recall are not so useful for this. They depend on a threshold and are traded off.