I'm quite an amateur data analyst/modeller trying to teach myself some new techniques. I have a system in which, on the occurrence of an event, two outcomes can occur (either 0 or 1), and a number of prediction models designed to give the probability of the outcome 1, based on previous events. Some of these models are obviously better than others, but I'm looking for a way to quantitatively assess the performance of those where it is less obvious. For example, given two models (p1 and p2) and two events (out1, out2):
p1 = 0.70
p2 = 0.60
out1 = 1
p1 = 0.65
p2 = 0.55
out2 = 0
In the first event, I would expect most to consider p1 to be better, as it has predicted a higher probability of the outcome being 1. In the second event, I would expect most to consider p2 to be better as, even though both models have predicted in the wrong direction, p2 predicted a smaller probability of the outcome being wrong.
This suggests to me that simply scoring each model based on something like
if (p > 0.5 && out = 1) score++;
if (p <= 0.5 && out = 0) score++;
would be insufficient to accurately evaluate these models. My questions are:
- What methods could be used to evaluate these models?
- How do these methods differ in what they tell us about the models? e.g. are some methods skewed towards models which are often correct, but rarely have probabilities outside a range 0.45 < p 0.55, or skewed towards models which are less often correct but regularly have probabilities in the ranges 0 <= p < 0.2 and 0.8 < p <= 1?
- (Slightly less relevant but interesting) How do these methods translate into a system which has more than two outcomes?
Best Answer
The following is taken from a related answer I gave here:
As for your 2nd point about comparing the methods, in short, yes some methods are different. You might be interested in Some Comparisons among Quadratic, Spherical, and Logarithmic Scoring Rules by J. Eric Bickel. You might also be interested in Winkler and Murphy's Nonlinear Utility and the Probability Score. With nonlinear utility functions, risk-taking modelers will prefer to reports probabilities closer to certainty (towards 0 or 1 probability) whereas risk-avoiders will prefer to hedge to probabilities closer to .5 (or $\frac{1}{R}$ in the case of $R$ classes).
As for your 3rd point about translating to more than two outcomes, there are two main approaches I know of. One would be to calculate a separate score for each type of event. So if there are events A, B, and C, we create one score for forecasting A vs. not A, one for B vs. not B, and one for C vs. not C. In effect this turns our problem into many binary problems. The other approach is to effectively sum the scores over all classes. In the case of the Brier score that would look like this (assuming $R$ distinct classes): $$ BS = \frac{1}{N}\sum\limits _{t=1}^{N}\sum\limits _{i=1}^{R}(f_{ti}-o_{ti})^2$$