Solved – Statistical approach to compare the calibration between models

calibrationlogisticmachine learningrscoring-rules

This seems like a common problem but I cannot find a solution.

I have a set of binary observations and two different models, each with predictions for each observation. I want to compare the calibration of the models.

There are several approaches to comparing the discrimination of these models (i.e. see the roc.test in the pROC package in R), but no approach to compare calibration. Most empirical papers just list the p-values from two different calibration tests that are testing whether each model's calibration is off (i.e. Hosmer-Lemeshow, Brier score).

What I am looking for is a direct statistical comparison of the calibration between two models.

Here's an extreme test data set. The values of the Brier test, Spiegelhalter Z-test, etc all support that p2 is better calibrated, and we know it is. Can anyone make this into a formal statistical test?

library("pROC")
y <- rbinom(100,1,1:100/100)
p1 <- 1:100/10001
p2 <- 1:100/101
val.prob(p1,y)
val.prob(p2,y)

Best Answer

As you know the Brier score measures calibration and is the mean square error, $\bar B = n^{-1} \sum (\hat y_i - y_i)^2$, between the predictions, $\hat y,$ and the responses, $y$. Since the Brier score is a mean, comparing two Brier scores is basically a comparison of means and you can go as fancy with it as you like. I'll suggest two things and point to a third:

One option: do a t-test

My immediate response when I hear comparisons of means is to do a t-test. Squared errors probably aren't normally distributed in general so it's possible that this isn't the most powerful test. It seems fine in your extreme example. Below I test the alternative hypothesis that p1 has greater MSE than p2:

y <- rbinom(100,1,1:100/100)
p1 <- 1:100/10001
p2 <- 1:100/101

squares_1 <- (p1 - y)^2
squares_2 <- (p2 - y)^2

t.test(squares_1, squares_2, paired=T, alternative="greater")
#> 
#>  Paired t-test
#> 
#> data:  squares_1 and squares_2
#> t = 4.8826, df = 99, p-value = 2.01e-06
#> alternative hypothesis: true difference in means is greater than 0
#> 95 percent confidence interval:
#>  0.1769769       Inf
#> sample estimates:
#> mean of the differences 
#>               0.2681719

We get a super-low p-value. I did a paired t-test as, observation for observation, the two sets of predictions compare against the same outcome.

Another option: permutation testing

If the distribution of the squared errors worries you, perhaps you don't want to make assumptions of a t-test. You could for instance test the same hypothesis with a permutation test:

library(plyr)

observed <- mean(squares_1) - mean(squares_2)
permutations <- raply(500000, {
  swap <- sample(c(T, F), 100, replace=T)
  one <- squares_1
  one[swap] <- squares_2[swap]

  two <- squares_2
  two[swap] <- squares_1[swap]

  mean(one) - mean(two)
})

hist(permutations, prob=T, nclass=60, xlim=c(-.4, .4))
abline(v=observed, col="red")

# p-value. I add 1 so that the p-value doesn't come out 0
(sum(permutations > observed) + 1)/(length(permutations) + 1) 
#> [1] 1.999996e-06

The two tests seem to agree closely.

Some other answers

A quick search of this site on comparison of MSEs point to the Diebold-Mariano test (see the answer here, and a comment here). This looks like it's simply Wald's test and I guess it will perform similarly to the t-test above.

Related Solutions

Solved – Choose a model when the Hosmer and Lemeshow test is significant

Model 2 has the higher area under the response curve. So it therefore appears to be slightly better.

Confidence Interval in Logistic Regression – How to Get a Confidence Interval Around the Output of Logistic Regression

This isn't really possible to do in sklearn. In order to do this, you need the variance-covariance matrix for the coefficients (this is the inverse of the Fisher information which is not made easy by sklearn). Somewhere on stackoverflow is a post which outlines how to get the variance covariance matrix for linear regression, but it that can't be done for logistic regression. I'll also note that you are actually using ridge logistic regression as sklearn induces a penalty on the log-likelihood by default.

That being said, it is possible to do with statsmodels.

import numpy as np
import matplotlib.pyplot as plt
from scipy.special import expit, logit
from statsmodels.tools import add_constant
import statsmodels.api as sm
import pandas as pd

x = np.linspace(-2,2)
X = add_constant(x)
eta = X@np.array([-2.0,0.8])
p = expit(eta)
y = np.random.binomial(n=1,p=p)

model = sm.Logit(y, X).fit()
pred = model.predict(X)

se = np.sqrt(np.array([xx@model.cov_params()@xx for xx in X]))

df = pd.DataFrame({'x':x, 'pred':pred, 'ymin': expit(logit(pred) - 1.96*se), 'ymax': expit(logit(pred) + 1.96*se)})

ax = df.plot(x = 'x', y = 'pred')
df.plot(x = 'x', y = 'ymin', linestyle = 'dashed', ax = ax, color = 'C0')
df.plot(x = 'x', y = 'ymax', linestyle = 'dashed', ax = ax, color = 'C0')

Here ymin and ymax are the 95% confidence interval limits. You may be able to subclass LogisticRegression to easily gain access to the log-likelihood in which case you can invert the Hessian of the log-likelihood yourself, but that seems like a lot of work.