Probability – Comprehensive Methods for Evaluating Predictive Models in Two-Outcome Systems

predictive-modelsprobability

I'm quite an amateur data analyst/modeller trying to teach myself some new techniques. I have a system in which, on the occurrence of an event, two outcomes can occur (either 0 or 1), and a number of prediction models designed to give the probability of the outcome 1, based on previous events. Some of these models are obviously better than others, but I'm looking for a way to quantitatively assess the performance of those where it is less obvious. For example, given two models (p1 and p2) and two events (out1, out2):

p1 = 0.70
p2 = 0.60
out1 = 1

p1 = 0.65
p2 = 0.55
out2 = 0

In the first event, I would expect most to consider p1 to be better, as it has predicted a higher probability of the outcome being 1. In the second event, I would expect most to consider p2 to be better as, even though both models have predicted in the wrong direction, p2 predicted a smaller probability of the outcome being wrong.

This suggests to me that simply scoring each model based on something like

if (p > 0.5 && out = 1)  score++;
if (p <= 0.5 && out = 0)  score++;

would be insufficient to accurately evaluate these models. My questions are:

  • What methods could be used to evaluate these models?
  • How do these methods differ in what they tell us about the models? e.g. are some methods skewed towards models which are often correct, but rarely have probabilities outside a range 0.45 < p 0.55, or skewed towards models which are less often correct but regularly have probabilities in the ranges 0 <= p < 0.2 and 0.8 < p <= 1?
  • (Slightly less relevant but interesting) How do these methods translate into a system which has more than two outcomes?

Best Answer

The following is taken from a related answer I gave here:

Suppose your model does indeed predict A has a 40% chance and B has a 60% chance. In some circumstances you might wish to convert this into a classification that B will happen (since it is more likely than A). Once converted into a classification, every prediction is either right or wrong, and there are a number of interesting ways to tally those right and wrong answers. One is straight accuracy (the percentage of right answers). Others include precision and recall or F-measure. As others have mentioned, you may wish to look at the ROC curve. Furthermore, your context may supply a specific cost matrix that rewards true positives differently from true negatives and/or penalizes false positives differently from false negatives.

However, I don't think that's what you are really looking for. If you said B has a 60% chance of happening and I said it had a 99% chance of happening, we have very different predictions even though they would both get mapped to B in a simple classification system. If A happens instead, you are just kind of wrong while I am very wrong, so I'd hope that I would receive a stiffer penalty than you. When your model actually produces probabilities, a scoring rule is a measure of performance of your probability predictions. Specifically you probably want a proper scoring rule, meaning that the score is optimized for well-calibrated results.

A common example of a scoring rule is the Brier score: $$BS = \frac{1}{N}\sum\limits _{t=1}^{N}(f_t-o_t)^2$$ where $f_t$ is the forecasted probability of the event happening and $o_t$ is 1 if the event did happen and 0 if it did not.

Of course the type of scoring rule you choose might depend on what type of event you are trying to predict. However, this should give you some ideas to research further.

I'll add a caveat that regardless of what you do, when assessing your model this way I suggest you look at your metric on out-of-sample data (that is, data not used to build your model). This can be done through cross-validation. Perhaps more simply you can build your model on one dataset and then assess it on another (being careful not to let inferences from the out-of-sample spill into the in-sample modeling).

As for your 2nd point about comparing the methods, in short, yes some methods are different. You might be interested in Some Comparisons among Quadratic, Spherical, and Logarithmic Scoring Rules by J. Eric Bickel. You might also be interested in Winkler and Murphy's Nonlinear Utility and the Probability Score. With nonlinear utility functions, risk-taking modelers will prefer to reports probabilities closer to certainty (towards 0 or 1 probability) whereas risk-avoiders will prefer to hedge to probabilities closer to .5 (or $\frac{1}{R}$ in the case of $R$ classes).

As for your 3rd point about translating to more than two outcomes, there are two main approaches I know of. One would be to calculate a separate score for each type of event. So if there are events A, B, and C, we create one score for forecasting A vs. not A, one for B vs. not B, and one for C vs. not C. In effect this turns our problem into many binary problems. The other approach is to effectively sum the scores over all classes. In the case of the Brier score that would look like this (assuming $R$ distinct classes): $$ BS = \frac{1}{N}\sum\limits _{t=1}^{N}\sum\limits _{i=1}^{R}(f_{ti}-o_{ti})^2$$