Solved – Understanding the Rank Probability Score

forecastingmodel selectionpredictive-models

The ranked probability score (RPS) is a measure of how good forecasts that are expressed as probability distributions are in matching observed outcomes. Both the location and spread of the forecast distribution are taken into account in judging how close the distribution is to the observed value.

\begin{equation}
\mathrm{RPS}=\dfrac{1}{r-1}\sum\limits_{i=1}^{r}\left(\sum\limits_{j=1}^{i}p_j-\sum\limits_{j=1}^{i}e_j\right)^2,
\end{equation}

where $r$ is the number of outcomes, $p_j$ is the forecasted probability of outcome $j$ and $e_j$ is the actual probability of outcome $j$. The special case $r=2$ gives the brier score.

The score lies between 0 and 1, with lower scores being better, but does a value of 0.8, say, imply your forecasts are not very good? Or is this similar to $R^2$, for example, where a low value does not necessarily imply you have a bad model and it is better suited for comparing between models.

At first I thought it's unlikely that it would say anything about your forecasts (not relative to another model, at least), but then consider two outcomes ($r=2$) and say your model predicted each outcome to occur with probability 0.5. Then the RPS gives 0.25 regardless of the outcome. So generally, a value greater than 0.25 for the RPS says your model predicted the wrong outcome and a lower value implies your model predicted the correct outcome.

Unfortunately, this idea does not work for $r=3$. Taking the probability of the three outcomes to be 1/3 gives RPS values of 5/9, 2/9 and 5/9 when outcome 1, 2 and 3 actually occurred, respectively. This does makes sense though, predicting outcome 1 when outcome 2 happened is better than predicting outcome 1 when outcome 3 occurred and this is exactly the motivation for the predictive performance evaluator. Nonetheless, I am back to thinking it is only suitable for comparing models.

A second question: $R^2$ is the proportion of response variation "explained" by the regressors in the model. Is there a similar interpretation for the RPS?

Finally, if a model makes a number of forecasts – say it predicts the next outcome, the outcome is observed, the model is updated and then the model predicts the next outcome – would simply averaging the individual ranked probability scores for each predicted outcome be appropriate, for comparing between two models? I think this is reasonable, but some formal reasoning would obviously be preferred.

Best Answer

Equation 7 in Hersbach inspires me to notice that the RPS (as a discrete version of the continuous RPS, or CRPS) is a sum of several Brier quadratic probability scores (BS) evaluated over several probability thresholds. (Hersbach goes on to develop an interpretable decomposition of the CRPS for ensemble forecasts.) That is

\begin{equation} \mathrm{RPS}=\sum\limits_{i=1}^{r} BS(i), \end{equation}

where $BS(i)$ is the Brier score for a single forecast of the probability that the outcome of interest is one of the first $i$ (out of $r$ possible) outcomes.

Seeing the RPS as a sum of Brier scores is potentially interesting because the Brier score is a sum three interpretable components (see wikipedia, or page 754 here): $$BS = reliability - resolution + uncertainty$$

Reliability is a measure of the absence of bias.

Resolution is somewhat analogous to the R-squared in regression (but don't look for an exact analogy since there isn't a clear definition for R-squared for predictions with binary outcomes).

Uncertainty is somewhat analogous to residual standard error in regression.

If you see the RPS as a sum (or mean) of Brier scores and if you like the Brier score decomposition mentioned above, then surely there is a way to write the RPS as a something like

$$RPS = \overline {Rel} - \overline {Res} + \overline {Unc} $$

where the terms on the right-hand-side are the means of the Brier score decomposition components over the $r$ underlying Brier scores.

Thus, in a heuristic sense, I'm guessing that the closest thing you'll get to an "R-squared" with the RPS is an "average BS R-squared" if you take the time to do the decomposition.

For a relatively rigorous discussion of this decomposition for the RPS, see equation (8b) in Candille and Talagrand.