Solved – What does it mean that AUC is a semi-proper scoring rule

classificationmeasurement errorreferencesrocscoring-rules

A proper scoring rule is a rule that is maximized by a 'true' model and it doesn't allow 'hedging' or gaming the system (deliberately reporting different results as is the true belief of the model to improve the score). Brier score is proper, accuracy (proportion classified correctly) is improper and often discouraged. Sometimes I see that AUC is called a semi-proper scoring rule which makes it not completely bogus as accuracy, but less sensitive than proper rules (for example here https://stats.stackexchange.com/a/90705/53084).

What does semi-proper scoring rule mean? Is it defined somewhere?

Best Answer

Let's start with an example. Say Alice is a track coach and wants to pick an athlete to represent the team in an upcoming sporting event, a 200m sprint. Naturally she wants to pick the fastest runner.

  • A strictly proper scoring rule would be to nominate the fastest runner of the team over the 200m distance. This maximizes exactly what coach Alice wants in this situation. The athlete with the fastest expected performance gets selected - this is a fair discriminatory test.
  • A proper scoring rule would be to pick an athlete who is able run 200m the fastest but the time is rounded to the nearest half of a second. The best athlete as well as potentially some other athletes will also be able to pass this test. All athletes who are selected this way are quite competitive but clearly this is not a perfect discriminatory test of speed.
  • A semi-proper scoring rule would be to pick an athlete who is able run 200m below a competitive time threshold, e.g. 22 seconds. As before, the best athlete as well as some other athletes will also be able to pass this test. Similarly all athletes who are selected this way might be quite competitive but clearly not only this is not a perfect discriminatory test but it can also go horribly bad (if we pick a too lenient or too stringent time). Note that is not outright wrong.
  • An improper scoring rule would be to pick the athlete with the strongest legs, e.g. who can squat the most weights. Certainly, any good sprinter probably has very strong legs but this test means that some guys from the weight-lifting team will excel here. Clearly a weight-lifter in a 200m race would be catastrophic!

While somewhat trivialised the example above shows what takes place with the use of scoring rules. Alice was forecasting expected sprint time. Within the context of classification we forecast probabilities minimising the error of a probabilistic classifier.

  • A strictly proper scoring rule, like the Brier score, guarantees that the best score will only be attained when we are as close to the true probabilities as possible.
  • A proper scoring rule, like the continuous ranked probability score (CRPS), does not guarantee that the best score will only be attained by a classifier whose predictions are the closest to the true probabilities. Other candidate classifiers might attain CRPS scores that match that of the optimal classifier.
  • A semi-proper scoring rule, like the AUC-ROC, not only does it not guarantee that the best performance will be attained by a classifier whose predictions are the closest to the true probabilities, but it is also (potentially) possible to improve on the values of AUC-ROC by moving the predicted probabilities away from their true values. Nevertheless, under certain conditions (eg. the class distribution is a priori known in the case of AUC-ROC) such rules can approximate a proper scoring rule. Byrne (2016) "A note on the use of empirical AUC for evaluating probabilistic forecasts" raises some interesting points regarding AUC-ROC.
  • An improper scoring rule, like Accuracy, offers little to no connection to our original task of predicting probabilities as close as possible to the true probabilities.

As we see semi-proper scoring rule is not perfect but it is not outright catastrophic either. It can be quite useful during prediction actually! Cagdas Ozgenc has a great example here where working with an improper/semi-proper rule is preferable to a strictly proper rule. In general, the term semi-proper scoring rule is not very common. It is associated with improper rules that can be nevertheless helpful (eg. AUC-ROC or MAE in probabilistic classification).

Finally, notice something important. As sprinting is associated with strong legs, so is correct probabilistic classification with Accuracy. It is unlikely that a good sprinter will have weak legs and similarly it is unlikely that a good classifier will have bad Accuracy. Nevertheless, equating Accuracy with good classifier performance is like equating leg strength with good sprinting performance. Not completely unfounded but very plausible to lead to nonsensical results.

Related Question