This seems like a common problem but I cannot find a solution.
I have a set of binary observations and two different models, each with predictions for each observation. I want to compare the calibration of the models.
There are several approaches to comparing the discrimination of these models (i.e. see the roc.test in the pROC package in R), but no approach to compare calibration. Most empirical papers just list the p-values from two different calibration tests that are testing whether each model's calibration is off (i.e. Hosmer-Lemeshow, Brier score).
What I am looking for is a direct statistical comparison of the calibration between two models.
Here's an extreme test data set. The values of the Brier test, Spiegelhalter Z-test, etc all support that p2 is better calibrated, and we know it is. Can anyone make this into a formal statistical test?
library("pROC")
y <- rbinom(100,1,1:100/100)
p1 <- 1:100/10001
p2 <- 1:100/101
val.prob(p1,y)
val.prob(p2,y)
Best Answer
As you know the Brier score measures calibration and is the mean square error, $\bar B = n^{-1} \sum (\hat y_i - y_i)^2$, between the predictions, $\hat y,$ and the responses, $y$. Since the Brier score is a mean, comparing two Brier scores is basically a comparison of means and you can go as fancy with it as you like. I'll suggest two things and point to a third:
One option: do a t-test
My immediate response when I hear comparisons of means is to do a t-test. Squared errors probably aren't normally distributed in general so it's possible that this isn't the most powerful test. It seems fine in your extreme example. Below I test the alternative hypothesis that
p1
has greater MSE thanp2
:We get a super-low p-value. I did a paired t-test as, observation for observation, the two sets of predictions compare against the same outcome.
Another option: permutation testing
If the distribution of the squared errors worries you, perhaps you don't want to make assumptions of a t-test. You could for instance test the same hypothesis with a permutation test:
The two tests seem to agree closely.
Some other answers
A quick search of this site on comparison of MSEs point to the Diebold-Mariano test (see the answer here, and a comment here). This looks like it's simply Wald's test and I guess it will perform similarly to the t-test above.