Scoring Rules – Understanding Brier Loss Composition

calibrationscoring-rules

I built several models and measured the brier loss, calibration loss, and reliability loss with the direct model and a calibrated one. Now I try to interpret the results, but I cannot understand them in combination with the calibration plots.

My understanding is that the smaller the calibration loss the better the calibration. But how can I interpret the refinement loss? According to Wikipedia, "The second term is known as refinement. It is an aggregation of resolution and uncertainty and is related to the area under the ROC Curve" But the ROC score does not change for any model. However, the refinement loss is vastly different for the "Gradient" and "XGB."

I also cannot connect the calibration plot to the actual calibration loss. For me, the uncalibrated plots look much more calibrated, but the score is lower for the calibrated models. I assume that maybe the missing part in the uncalibrated plot has something to do with it.

I am grateful for any hints also recommendations for literature.

Calibrated
               Brier Loss       Calibration Loss  Refinement Loss
Classifier                                                
Dummy Strat    0.162253          0.000034         0.162218
Gradient       0.149662          0.047544         0.102118
XGB            0.150804          0.085912         0.064892

Uncalibrated
               Brier Loss       Calibration Loss  Refinement Loss
Classifier                                                
Dummy Strat    0.326294          0.164076         0.162218
Gradient       0.149962          0.095328         0.054634
XGB            0.151905          0.128606         0.023299

The first picture shows the calibration plot (Left: Uncalibrated / Right: Calibrated)

The second picture shows the distribution plot (Left: Uncalibrated / Right: Calibrated)

Best Answer

I could almost answer all my questions:

The uncertainty in this example is about 0.16, which means that the resolution for the DummyClassifier is around 0, which makes sense because it only predicts one value. The refinement loss of the other classifier is shrinking because the calibration lowers the prediction values.

uncertainty how unbalanced is the outcome? It is 0.25 (max value) for equally distributed outcomes and 0 if there is only one outcome.

resolution how extreme are the probabilities? It is 0 if the probabilities are equal to the average (like in the DummyClassifier), and it is the same as the uncertainty if there are only 0 / 1 predictions.

The comparison with the calibration plot is a display issue of how the graph is created and rendered.

One option: do a t-test

My immediate response when I hear comparisons of means is to do a t-test. Squared errors probably aren't normally distributed in general so it's possible that this isn't the most powerful test. It seems fine in your extreme example. Below I test the alternative hypothesis that p1 has greater MSE than p2:

y <- rbinom(100,1,1:100/100)
p1 <- 1:100/10001
p2 <- 1:100/101

squares_1 <- (p1 - y)^2
squares_2 <- (p2 - y)^2

t.test(squares_1, squares_2, paired=T, alternative="greater")
#> 
#>  Paired t-test
#> 
#> data:  squares_1 and squares_2
#> t = 4.8826, df = 99, p-value = 2.01e-06
#> alternative hypothesis: true difference in means is greater than 0
#> 95 percent confidence interval:
#>  0.1769769       Inf
#> sample estimates:
#> mean of the differences 
#>               0.2681719

We get a super-low p-value. I did a paired t-test as, observation for observation, the two sets of predictions compare against the same outcome.

Another option: permutation testing

If the distribution of the squared errors worries you, perhaps you don't want to make assumptions of a t-test. You could for instance test the same hypothesis with a permutation test:

library(plyr)

observed <- mean(squares_1) - mean(squares_2)
permutations <- raply(500000, {
  swap <- sample(c(T, F), 100, replace=T)
  one <- squares_1
  one[swap] <- squares_2[swap]

  two <- squares_2
  two[swap] <- squares_1[swap]

  mean(one) - mean(two)
})

hist(permutations, prob=T, nclass=60, xlim=c(-.4, .4))
abline(v=observed, col="red")

# p-value. I add 1 so that the p-value doesn't come out 0
(sum(permutations > observed) + 1)/(length(permutations) + 1) 
#> [1] 1.999996e-06

The two tests seem to agree closely.

Some other answers

A quick search of this site on comparison of MSEs point to the Diebold-Mariano test (see the answer here, and a comment here). This looks like it's simply Wald's test and I guess it will perform similarly to the t-test above.

Solved – Probability calibration metric for multiclass classifier

Following Guo et al., I ended up using the Expected Calibration Error, defined as $$\sum_{m=1}^M\frac{|{B_{m}|}}{n}\left|acc(B_m) - conf(B_m)\right|$$

In extending this to multiclass, one can either take the maximum probability for each prediction, or average across the top $n$ predictions, if desired.

Best Answer

Related Solutions

Solved – Statistical approach to compare the calibration between models

One option: do a t-test

Another option: permutation testing

Some other answers

Solved – Probability calibration metric for multiclass classifier

Related Question