Scoring Rules – Comparing Logarithmic Loss, Brier Score, and AUC Score

auclog-lossscoring-rules

I have a dataset with two classes of elements. I also have two methods which assign (complementary) probabilities to each element in the dataset of belonging to either class.

Given that I work with probabilities (instead of hard 0,1 classification values), I was pointed to scoring rules as a way to asses which method performs better. The two most used rules appear to be:

Logarihmic scoring rule (Log loss, logistic loss, cross-entropy loss)
Brier/quadratic scoring rule (Brier score)

with Log loss apparently being the standard approach (is it?). I also found scikit-learn's
roc_auc_score, an implementation of the:

Area Under the Curve (AUC, ROC-AUC)

which appears to do pretty much the same thing.

My question is: is either one of these inherently "better" than the other in some form? I also could just use all three. Is this advisable?

Best Answer

The choice depends on how you plan to use the model. There are many potential strictly proper scoring rules (AUC isn't one). They effectively put different weights on different parts of the probability scale while still all meeting the requirement of having an optimal value at the true probabilities.

I have found the report "Loss Functions for Binary Class Probability Estimation and Classification: Structure and Applications," by Andreas Buja, Werner Stuetzle, and Yi Shen, to be very helpful in thinking about this. The authors show that choice of probability cutoff is equivalent to a choice of the relative cost of false-positive and false-negative classifications. They then provide a way to tailor loss functions to meet different choices of relative costs.

So the choice of scoring rule might best take the eventual use of the model into account. For a bit more detail without going into that full 48-page report, see related answers here and here.

Best Answer

Related Solutions

Machine Learning – Why is AUC Higher for a Less Accurate Classifier?

Solved – Why the Brier Score’s better when probabilities are estimated through PAVA instead of Platt Scaling

Related Question