F2 score or the Area under the Precision-Recall-Curve as a scoring metric

classificationcross-validationmachine learningprecision-recallscikit learn

I have a dataset with which I want to perform binary classification.
The distribution of the target class is imbalanced: 20% positive labels, 80% negative labels.
The positive class is more important than the negative class, false negatives (FN) are more costly than false positives (FP), and I need to give out probabilities for all predictions.

enter image description here
The image is taken from MachineLearningMastery.com.

Should I use the F2 score or the Area under the Precision-Recall-Curve (PR AUC) as a scoring metric?

There are multiple ways to calculate the PR AUC with scikit-learn.

  1. average_precision = average_precision_score(y_test, y_score)
  2. precision, recall, thresholds = precision_recall_curve(y_test, y_score)
    auc_precision_recall = auc(recall, precision)

Which way is more exact or common for calculating the PR AUC?

Best Answer

There's nothing to keep you from calculating several metrics, so evaluate all metrics that are relevant for your application.

  • A model is rarely (if ever?) characterized well with a single metric.

    E.g., acceptable-looking accuracy, G-Mean, Brier's score or AuROC may be based on a highly sensitive but only moderately specific model - or on a moderately sensitive but highly specific model. Which makes a whole lot of a difference to your application.

    Also, AuROC summarizes certain aspects of model performance at all possible working points/thresholds. Including regions that are irrelevant (because they are inacceptable) for your application.

  • You crucially need to account for the prevalence/incidence/prior probability of the classes, since that also drives the total number of false positives and false negatives that you expect. While related to the class imbalance in your data, observed class imbalance in the training data and expected prior probability under the application scenario may be very different in practice.

    Sensitivity and specificity are easy to measure also under class imbalance. But from an application perspective the predictive values (what is the probability the prediction is correct, given that it is positive[negative]) are what is needed, and they crucially depend on the prevalence/incidence/prior probability of the classes.

  • since you do predict probabilities, make sure you do any data-driven model/hyperparameter optimization with a proper scoring rule (e.g. Brier's score/MSE).

    (Side note: it is possible to derive metrics that behave continuously and share the better variance properties analogously to e.g. Brier's score, but track particular aspects of predictive performance like sensitivity/specificity or the predictive values.)

  • It may be possible to construct a single metric that gives you an appropriate summary of fitness-for-purpose, but you'll need to set that up specifically for your application.

Related Question