Model Evaluation – How to Compute the Brier Score for More Than Two Classes

classificationmodel-evaluationscikit learnscoring-rules

tl;dr

How do I correctly compute the Brier score for more than two classes? I got confusing results with different approaches. Details below.


As suggested to me in a comment to this question, I would like to evaluate the quality of a set of classifiers I trained with the Brier score. These classifiers are multiclass classifiers and the classes are imbalanced. The Brier score should be able to handle these conditions. However, I am not quite confident about how to apply the Brier score test. Say I have 10 data points and 5 classes:

One hot vectors represent which class is present in a given item of data:

targets = array([[0, 0, 0, 0, 1],
                 [0, 0, 0, 0, 1],
                 [0, 0, 0, 0, 1],
                 [0, 1, 0, 0, 0],
                 [0, 0, 0, 0, 1],
                 [0, 0, 1, 0, 0],
                 [1, 0, 0, 0, 0],
                 [0, 1, 0, 0, 0],
                 [1, 0, 0, 0, 0],
                 [1, 0, 0, 0, 0]])

Vectors of probabilities represent the outputs of my classifiers, assigning a probability to each class

probs = array([[0.14, 0.38, 0.4 , 0.04, 0.05],
               [0.55, 0.05, 0.34, 0.04, 0.01],
               [0.3 , 0.35, 0.18, 0.09, 0.08],
               [0.23, 0.22, 0.04, 0.05, 0.46],
               [0.  , 0.15, 0.47, 0.28, 0.09],
               [0.23, 0.13, 0.34, 0.27, 0.03],
               [0.32, 0.06, 0.59, 0.02, 0.01],
               [0.01, 0.19, 0.01, 0.03, 0.75],
               [0.27, 0.38, 0.03, 0.12, 0.2 ],
               [0.17, 0.45, 0.11, 0.25, 0.01]])

These matrices are coindexed, so probs[i, j] is the probability of class targets[i, j].

Now, according to Wikipedia the definition of the Brier Score for multiple classes is

$$\frac{1}{N} \sum_{t=1}^{N} \sum_{i=1}^{R} (f_{ti} – o_{ti})^2$$

When I program this in Python and run it on the above targets and probs matrices, I get a result of $1.0069$

>>> def brier_multi(targets, probs):
...     return np.mean(np.sum((probs - targets)**2, axis=1))
... 
>>> brier_multi(targets, probs)
1.0068899999999998

But I am not sure if I interpreted the definition correctly.

For Python the sklearn library provides sklearn.metrics.brier_score_loss. While the documentation states

The Brier score is appropriate for binary and categorical outcomes that can be structured as true or false

What the function actually does is pick one (or get one passed as an argument) of $n > 2$ classes and treat that class as class $1$ and all other classes as class $0$.

For example, if we choose class 3 (index 2) as the $1$ class and thus all other classes as class $0$, we get:

>>> # get true classes by argmax over binary arrays
... true_classes = np.argmax(targets, axis=1)
>>> 
>>> brier_score_loss(true_classes, probs[:,2], pos_label=2)
0.13272999999999996

alternatively:

>>> brier_score_loss(targets[:,2], probs[:,2])
0.13272999999999996

This is indeed the binary version of the Brier score, as can be shown by manually defining and running it:

>>> def brier_bin_(targets, probs):
...     return np.mean((targets - probs) ** 2)
>>> brier_bin(targets[:,2], probs[:,2])
0.13272999999999996

As you can see, this is the same result as with sklearn's brier_score_loss.

Wikipedia states about the binary version:

This formulation is mostly used for binary events (for example "rain"
or "no rain"). The above equation is a proper scoring rule only for
binary events;

So… Now I am confused and have the following questions:

1) If sklearn computes the multi class Brier score as a One vs. All binary score, is that the only and correct way to compute the multi class Brier score?

Which leads me to

2) If that is so, my brier_multi code must be based on a misconception. What is my misconception about the definition of the multiclass Brier score?

3) Maybe I am on the wrong track altogether. In which case, please explain to me, how I compute the Brier score correctly?

Best Answer

Wikipedia's version of the Brier score for multiple categories is correct. Compare the original publication by Brier (1950), or any number of academic publications, e.g. Czado et al. (2009) (equation (6), though you would need to do some simple arithmetic and drop a constant 1 to arrive at Brier's formulation).

  1. If sklearn calculates a binary "one against all" Brier score and averages over all choices of a focal class, then it can certainly do so. However, it is simply not the Brier score. Passing it off as such is misleading and wrong.

  2. The misconception lies entirely with sklearn.

  3. Just use your brier_multi, it's completely correct.