Solved – Understanding the Rank Probability Score

forecastingmodel selectionpredictive-models

The ranked probability score (RPS) is a measure of how good forecasts that are expressed as probability distributions are in matching observed outcomes. Both the location and spread of the forecast distribution are taken into account in judging how close the distribution is to the observed value.

\begin{equation}
\mathrm{RPS}=\dfrac{1}{r-1}\sum\limits_{i=1}^{r}\left(\sum\limits_{j=1}^{i}p_j-\sum\limits_{j=1}^{i}e_j\right)^2,
\end{equation}

where $r$ is the number of outcomes, $p_j$ is the forecasted probability of outcome $j$ and $e_j$ is the actual probability of outcome $j$. The special case $r=2$ gives the brier score.

The score lies between 0 and 1, with lower scores being better, but does a value of 0.8, say, imply your forecasts are not very good? Or is this similar to $R^2$, for example, where a low value does not necessarily imply you have a bad model and it is better suited for comparing between models.

At first I thought it's unlikely that it would say anything about your forecasts (not relative to another model, at least), but then consider two outcomes ($r=2$) and say your model predicted each outcome to occur with probability 0.5. Then the RPS gives 0.25 regardless of the outcome. So generally, a value greater than 0.25 for the RPS says your model predicted the wrong outcome and a lower value implies your model predicted the correct outcome.

Unfortunately, this idea does not work for $r=3$. Taking the probability of the three outcomes to be 1/3 gives RPS values of 5/9, 2/9 and 5/9 when outcome 1, 2 and 3 actually occurred, respectively. This does makes sense though, predicting outcome 1 when outcome 2 happened is better than predicting outcome 1 when outcome 3 occurred and this is exactly the motivation for the predictive performance evaluator. Nonetheless, I am back to thinking it is only suitable for comparing models.

A second question: $R^2$ is the proportion of response variation "explained" by the regressors in the model. Is there a similar interpretation for the RPS?

Finally, if a model makes a number of forecasts – say it predicts the next outcome, the outcome is observed, the model is updated and then the model predicts the next outcome – would simply averaging the individual ranked probability scores for each predicted outcome be appropriate, for comparing between two models? I think this is reasonable, but some formal reasoning would obviously be preferred.

Best Answer

Equation 7 in Hersbach inspires me to notice that the RPS (as a discrete version of the continuous RPS, or CRPS) is a sum of several Brier quadratic probability scores (BS) evaluated over several probability thresholds. (Hersbach goes on to develop an interpretable decomposition of the CRPS for ensemble forecasts.) That is

\begin{equation} \mathrm{RPS}=\sum\limits_{i=1}^{r} BS(i), \end{equation}

where $BS(i)$ is the Brier score for a single forecast of the probability that the outcome of interest is one of the first $i$ (out of $r$ possible) outcomes.

Seeing the RPS as a sum of Brier scores is potentially interesting because the Brier score is a sum three interpretable components (see wikipedia, or page 754 here): $$BS = reliability - resolution + uncertainty$$

Reliability is a measure of the absence of bias.

Resolution is somewhat analogous to the R-squared in regression (but don't look for an exact analogy since there isn't a clear definition for R-squared for predictions with binary outcomes).

Uncertainty is somewhat analogous to residual standard error in regression.

If you see the RPS as a sum (or mean) of Brier scores and if you like the Brier score decomposition mentioned above, then surely there is a way to write the RPS as a something like

$$RPS = \overline {Rel} - \overline {Res} + \overline {Unc} $$

where the terms on the right-hand-side are the means of the Brier score decomposition components over the $r$ underlying Brier scores.

Thus, in a heuristic sense, I'm guessing that the closest thing you'll get to an "R-squared" with the RPS is an "average BS R-squared" if you take the time to do the decomposition.

For a relatively rigorous discussion of this decomposition for the RPS, see equation (8b) in Candille and Talagrand.

Related Solutions

Solved – How to test the predictive power of a model

ROC, sensitivity, specificity, and cutoffs have gotten in the way, unfortunately. Assuming there is nothing between "good" and "bad" and that the success of the experiment was not based on an underlying continuum that should have instead formed the dependent variable, a probability model such as logistic regression would seem to be called for. You may need to do resampling to get an unbiased appraisal of the model's likely future performance. Note that even though a receiver operating characteristic curve is seldom appropriate, its area (also called c-index or concordance probability from the Wilcoxon-Mann-Whitney test) is a good summary measure of pure predictive discrimination. On the other hand, percent classified correctly is an improper scoring rule that, if optimized, will result in a bogus model.

Predicted probabilities are your friend, and they are also self-contained error rates at the point where someone forces you to make a binary decision, if they do.

Data-Visualization – How to Visualize the Calibration of Predicted Probability of a Model

Your thinking is good.

John Tukey recommended binning by halves: split the data into upper and lower halves, then split those halves, then split the extreme halves recursively. Compared to equal-width binning, this allows visual inspection of tail behavior without devoting too many graphical elements to the bulk of the data (in the middle).

Here is an example (using R) of Tukey's approach. (It's not exactly the same: he implemented mletter a little differently.)

First, let's create some predictions and some outcomes that conform to those predictions:

set.seed(17)
prediction <- rbeta(500, 3/2, 5/2)
actual <- rbinom(length(prediction), 1, prediction)
plot(prediction, actual, col="Gray", cex=0.8)

The plot is not very informative, because all the actual values are, of course, either $0$ (did not occur) or $1$ (did occur). (It appears as the background of gray open circles in the first figure below.) This plot needs smoothing. To do so, we bin the data. Function mletter does the splitting-by-halves. Its first argument r is an array of ranks between 1 and n (the second argument). It returns unique (numeric) identifiers for each bin:

mletter <- function(r,n) {
    lower <-  2 + floor(log(r/(n+1))/log(2))
    upper <- -1 - floor(log((n+1-r)/(n+1))/log(2))
    i <- 2*r > n
    lower[i] <- upper[i]
    lower
}

Using this, we bin both the predictions and the outcomes and average each within each bin. Along the way, we compute bin populations:

classes <- mletter(rank(prediction), length(prediction))
pgroups <- split(prediction, classes)
agroups <- split(actual, classes)
bincounts <- unlist(lapply(pgroups, length)) # Bin populations
x <- unlist(lapply(pgroups, mean))           # Mean predicted values by bin
y <- unlist(lapply(agroups, mean))           # Mean outcome by bin

To symbolize the plot effectively we should make the symbol areas proportional to bin counts. It can be helpful to vary the symbol colors a little, too, whence:

binprop <- bincounts / max(bincounts)
colors <- -log(binprop)/log(2)
colors <- colors - min(colors)
colors <- hsv(colors / (max(colors)+1))

With these in hand, we now enhance the preceding plot:

abline(0,1, lty=1, col="Gray")                           # Reference curve
points(x,y, pch=19, cex = 3 * sqrt(binprop), col=colors) # Solid colored circles
points(x,y, pch=1, cex = 3 * sqrt(binprop))              # Circle outlines

As an example of a poor prediction, let's change the data:

set.seed(17)
prediction <- rbeta(500, 5/2, 1)
actual <- rbinom(length(prediction), 1, 1/2 + 4*(prediction-1/2)^3)

Repeating the analysis produces this plot in which the deviations are clear:

This model tends to be overoptimistic (average outcome for predictions in the 50% to 90% range are too low). In the few cases where the prediction is low (less than 30%), the model is too pessimistic.

Best Answer

Related Solutions

Solved – How to test the predictive power of a model

Data-Visualization – How to Visualize the Calibration of Predicted Probability of a Model

Related Question