Machine Learning – Proper Scoring Rule for Decision Making (e.g., Spam vs Ham Email)

accuracyclassificationmachine learningmodel-evaluationscoring-rules

Among others on here, Frank Harrell is adamant about using proper scoring rules to assess classifiers. This makes sense. If we have 500 $0$s with $P(1)\in[0.45, 0.49]$ and 500 $1$s with $P(1)\in[0.51, 0.55]$, we can get a perfect classifier by setting our threshold at $0.50$. However, is that really a better classifier than one that gives the $0$s all $P(1)\in[0.05, 0.07]$ and the $1$s all $P(1)\in[0.93,0.95]$, except for one that has $P(1)=0.04?$

Brier score says that the second classifier crushes the first, even though the second cannot achieve perfect accuracy.

set.seed(2020)
N <- 500
spam_1 <- runif(N, 0.45, 0.49) # category 0
ham_1 <- runif(N, 0.51, 0.55) # category 1
brier_score_1 <- sum((spam_1)^2) + sum((ham_1-1)^2)
spam_2 <- runif(N, 0.05, 0.07) # category 0 
ham_2 <- c(0.04, runif(N-1, 0.93, 0.95)) # category 1
brier_score_2 <- sum((spam_2)^2) + sum((ham_2-1)^2)
brier_score_1 # turns out to be 221.3765
brier_score_2 # turns out to be 4.550592

However, if we go with the second classifier, we end up calling a "ham" email "spam" and sending it to the spam folder. Depending on the email content, that could be quite bad news. With the first classifier, if we use a threshold of $0.50$, we always classify the spam as spam and the ham as ham. The second classifier has no threshold that can give the perfect classification accuracy that would be so wonderful for email filtering.

I concede that I don't know the inner workings of a spam filter, but I suspect there's a hard decision made to send an email to the spam folder or let it through to the inbox.$^{\dagger}$ Even if this is not how the particular example of email filtering works, there are situations where decisions have to be made.

As the user of a classifier who has to make a decision, what is the benefit of using a proper scoring rule as opposed to finding the optimal threshold and then assessing the performance when we classify according to that threshold? Sure, we may value sensitivity or specificity instead of just accuracy, but we don't get any of those from a proper scoring rule. I can imagine the following conversation with a manager.

Me: "So I propose that we use the second model, because of its much lower Brier score."

Boss: "So you want to go with the model that [goofs] up more often? SECURITY!"

I can see an argument that the model with the lower Brier score (good) but lower accuracy (bad) might be expected to perform better (in terms of classification accuracy) in the long run and should not be so harshly penalized because of a fluke point that the other model gets despite its generally worse performance, but that still feels like an unsatisfying answer to give a manager if we’re doing out-of-sample testing and seeing how these models perform on data to which they were not exposed during training.

$^{\dagger}$An alternative would be some kind of dice roll based on the probability determined by the classifier. Say we get $P(spam)=0.23$. Then draw an observation $X$ from $\text{Bernoulli}(0.23)$ and send it to the spam folder iff $X=1$. At some point, however, there is a decision made about where to send the email, no "23% send it to the spam folder, 77% let it through to the inbox".

Best Answer

I guess I'm one of the "among others", so I'll chime in.

The short version: I'm afraid your example is a bit of a straw man, and I don't think we can learn a lot from it.

In the first case, yes, you can threshold your predictions at 0.50 to get a perfect classification. True. But we also see that your model is actually rather poor. Take item #127 in the spam group, and compare it to item #484 in the ham group. They have predicted probabilities of being spam of 0.49 and 0.51. (That's because I picked the largest prediction in the spam and the smallest prediction in the ham group.)

That is, for the model they are almost indistinguishable in terms of their likelihood of being spam. But they aren't! We know that the first one is practically certain to be spam, and the second one to be ham. "Practically certain" as in "we observed 1000 instances, and the cutoff always worked". Saying that the two instances are practically equally likely to be spam is a clear indication that our model doesn't really know what it is doing.

Thus, in the present case, the conversation should not be whether we should go with model 1 or with model 2, or whether we should decide between the two models based on accuracy or on the Brier score. Rather, we should be feeding both models' predictions to any standard third model, such as a standard logistic regression. This will transform the predictions from model 1 to extremely confident predictions that are essentially 0 and 1 and thus reflect the structure in the data much better. The Brier score of this meta-model will be much lower, on the order of zero. And in the same way, the predictions from model 2 will be transformed into predictions that are almost as good, but a little worse - with a Brier score that is somewhat higher. Now, the Brier score of the two meta-models will correctly reflect that the one based on (meta-)model 1 should be preferred.


And of course, the final decision will likely need to use some kind of threshold. Depending on the costs of type I and II errors, the cost-optimal threshold might well be different from 0.5 (except, of course, in the present example). After all, as you write, it may be much more costly to misclassify ham as spam than vice versa. But as I write elsewhere, a cost optimal decision might also well include more than one threshold! Quite possibly, a very low predicted spam probability might have the mail sent to your inbox directly, while a very high predicted probability might have it filtered at the mail server without you ever seeing it - but probabilities in between might mean that a [SUSPECTED SPAM] might be inserted in the subject, and the mail would still be sent to your inbox. Accuracy as an evaluation measure fails here, unless we start looking at separate accuracy for the multiple buckets, but in the end, all the "in between" mails will be classified as one or the other, and shouldn't they have been sent to the correct bucket in the first place? Proper scoring rules, on the other hand, can help you calibrate your probabilistic predictions.


To be honest, I don't think deterministic examples like the one you give here are very useful. If we know what is happening, then we wouldn't be doing probabilistic classification/prediction in the first place, after all. So I would argue for probabilistic examples. Here is one such. I'll generate 1,000 true underlying probabilities as uniformly distributed on $[0,1]$, then generate actuals according to this probability. Now we don't have the perfect separation that I'm arguing fogs up the example above.

set.seed(2020)
nn <- 1000
true_probabilities <- runif(nn)
actuals <- runif(nn)<true_probabilities

library(beanplot)
beanplot(true_probabilities~actuals, 
    horizontal=TRUE,what=c(0,1,0,0),border=NA,col="lightgray",las=1,
    xlab="True probability")
points(true_probabilities,actuals+1+runif(nn,-0.3,0.3),pch=19,cex=0.6)

beanplot

Now, if we have the true probabilities, we can use cost-based thresholds as above. But typically, we will not know these true probabilities, but we may need to decide between competing models that each output such probabilities. I would argue that searching for a model that gets as close as possible to these true probabilities is worthwhile, because, for instance, if we have a biased understanding of the true probabilities, any resources we invest in changing the process (e.g., in medical applications: screening, inoculation, propagating lifestyle changes, ...) or in understanding it better may be misallocated. Put differently: working with accuracy and a threshold means that we don't care at all whether we predict a probability $\hat{p}_1$ or $\hat{p}_2$ as long as it's above the threshold, $\hat{p}_i>t$ (and vice versa below $t$), so we have zero incentive in understanding and investigating which instances we are unsure about, just as long as we get them to the correct side of the threshold.

Let's look at a couple of miscalibrated predicted probabilities. Specifically, for the true probabilities $p$, we can look at power transforms $\hat{p}_x:=p^x$ for some exponent $x>0$. This is a monotone transformation, so any thresholds we would like to use based on $p$ can also be transformed for use with $\hat{p}_x$. Or, starting from $\hat{p}_x$ and not knowing $p$, we can optimize thresholds $\hat{t}_x$ to get the exact same accuracies for $(\hat{p}_x,\hat{t}_x)$ as for $(\hat{p}_y,\hat{t}_y)$, because of the monotonicity. This means that accuracy is of no use whatsoever in our search for the true probabilities, which correspond to $x=1$! However (drum roll), proper scoring rules like the Brier or the log score will indeed be optimized in expectation by the correct $x=1$.

brier_score <- function(probs,actuals) mean(c((1-probs)[actuals]^2,probs[!actuals]^2))
log_score <- function(probs,actuals) mean(c(-log(probs[actuals]),-log((1-probs)[!actuals])))

exponents <- 10^seq(-1,1,by=0.1)
brier_scores <- log_scores <- rep(NA,length(exponents))
for ( ii in seq_along(exponents) ) {
    brier_scores[ii] <- brier_score(true_probabilities^exponents[ii],actuals)
    log_scores[ii] <- log_score(true_probabilities^exponents[ii],actuals)
}
plot(exponents,brier_scores,log="x",type="o",xlab="Exponent",main="Brier score",ylab="")
plot(exponents,log_scores,log="x",type="o",xlab="Exponent",main="Log score",ylab="")

scores

Related Question