machine-learning – Why is Absolute Loss Not a Proper Scoring Rule?

classificationloss-functionsmachine learningscoring-rulessupervised learning

Brier score is a proper scoring rule and is, at least in the binary classification case, square loss.

$$Brier(y,\hat{y}) = \frac{1}{N} \sum_{i=1}^N\big\vert y_i -\hat{y}_i\big\vert^2$$

Apparently this can be adjusted for when there are three or more classes.

In another post on Cross Validated, it is mentioned that absolute loss is not a proper scoring rule.

$$
absoluteLoss(y,\hat{y}) = \frac{1}{N} \sum_{i=1}^N\big\vert y_i -\hat{y}_i\big\vert
$$

It seems similar enough to Brier score that it should be a proper scoring rule.

Why is absolute loss not a proper scoring rule?
Is absolute loss a proper scoring rule in the binary classification case that loses its "properness" when there are more than two output categories?
Can absolute loss be wrestled with like Brier score to have a proper form when there are more that two classes?

At least in the binary case, absolute loss has an easier interpretation than Brier score or the square root of Brier score in that it says the average amount by which a predicted probability differs from the observed outcome, so I would like to have a way for absolute loss to be proper.

Best Answer

Let's first make sure we agree on definitions. Consider a binary random variable $Y \sim \text{Ber}(p)$, and consider a loss function $L(y_i|s)$, where $s$ is an estimate of $p$ given the data. In your examples, $s$ is a function of observed data $y_1,\dots,y_n$ with $s = \hat{p}$. The Brier score loss function is $L_b(y_i,s) = |y_i - s|^2$, and the absolute loss function is $L_a(y_i|s) = |y_i - s|$. A loss function has an expected loss $E_Y(L(Y|s)) := R(p|s)$. A loss function is a proper score rule if the expected loss $R(p|s)$ is minimized with respect to $s$ by setting $s=p$ for any $p\in(0,1)$.

A handy trick for verifying this is using the binary nature of $Y$, as for any expected loss, we have $$R(p|s) = pL(1|s) + (1-p)L(0|s)$$

and taking derivative of that function wrt to $s$ and setting to $0$ will give you that the choice of $s = p$ minimizes the expected risk. So the Brier score is indeed a proper score rule.

In contrast, recalling the binary nature of $Y$, we can write the absolute loss $L_a$ as $$L_a(y|s) = y(1-s) + (1-y)s$$ as $y\in\{0,1\}$. As such, we have that $$R_a(p|s) = p(1-s) + (1-p)s = p + s - 2ps$$

Unfortunately, $R_a(p|s)$ is not minimized by $s=p$, and by considering edge cases, you can show that $R_a(p|s)$ is minimized by $s=1$ when $p>.5$, and by $s=0$ when $p<.5$, and holds for any choice of $s$ when $p=.5$.

So to answer your questions, absolute loss is not a proper scoring rule, and that does not have to the with the number of output categories. As for whether it can be wrestled, I certainly can't think of a way... I think such attempts to think of similar approaches will probably lead you to the Brier score :).

Edit:

In response to OP's comment, note that the absolute loss approach is basically estimating the median of $Y$, which in the binary case is in expectation either $0$ or $1$ depending on $p$. The absolute loss just doesn't penalize the alternative choice enough to make you want to choose anything but the value that shows up the most. In contrast, the squared error penalizes the alternative enough to find a middle ground that coincides with the mean $p$. This should also highlight that there's nothing wrong with using absolute loss as a classifier, and you can think of it related to determining, for a given problem, if you care more about the mean or the median. For binary data, I'd personally say the mean is more interesting (knowing the median tells you whether p > .5, but knowing the mean tells you a more precise statement about $p$), but it depends. As the other post also emphasizes, there's nothing wrong with absolute loss, it just isn't a proper score rule.

Best Answer

Related Solutions

Solved – Is accuracy an improper scoring rule in a binary classification setting

TL;DR

The slightly longer version

Your confusion

The details: scoring rules vs. classification evaluations

The bottom line

The nitpick: "strict" vs. "strictly"

Model Evaluation – How to Compute the Brier Score for More Than Two Classes

Related Question