Time Series – Forecast Accuracy Metric That Involves Prediction Intervals

accuracyforecastingprediction intervaltime series

I'm in the process of generating a time series forecast for a company's product revenue and am looking for some way to show accuracy over time – e.g. after say 6 months they want to see how the actual revenue compared to the forecasted revenue generated 6 months ago.

I'm generating a forecast using the R ets() package and making predictions for each month over the next 6 months, including prediction intervals.

Are there any forecast accuracy metrics that take these prediction intervals into account?

I know of the standard MAPE, MASE etc but these all apply to point forecasts. What I'm looking for is a measure that also takes into account how accurate the prediction intervals are – e.g. if we're generating 95% prediction intervals, but the actual value only appears in them 10% of time, I want to be able to identify this.

Best Answer

This is a good question. Unfortunately, while the academic forecasting literature is indeed (slowly) moving from an almost exclusive emphasis on point forecasts towards interval forecasts and predictive densities, there has been little work on evaluating interval forecasts. (EDIT: See the bottom of this answer for an update.)

As gung notes, whether or not a given 95% prediction interval contains the true actual is in principle a Bernoulli trial with $p=0.95$. Given enough PIs and realizations, you can in principle test the null hypothesis that the actual coverage probability is in fact at least 95%.

However, you will need to think about statistical power and sample sizes. It's probably best to decide beforehand what kind of deviation from the target coverage is still acceptable (is 92% OK? 90%?), then find the minimum sample size needed to detect a deviation that is this strong or stronger with a given probability, say $\beta=80\%$, which is standard. You can do this by a straightforward simulation: simulate $n$ Bernoulli trials with $p=0.92$, estimate $\hat{p}$ with a confidence interval for it, see whether it contains the value 95%, do this "often", and tweak $n$ until 95% is outside the CI in $\beta=80\%$ of your cases. Or use any Bernoulli power calculator.

OK, now that we have our sample size, you can batch your PIs and realizations in batches of this sample size, see how often your PIs contain the true realization, and start testing. Your batches can be the last $n$ PI/realizations of a single time series, or all the latest PIs/realizations of a large number of time series you are forecasting, or whatever.

This approach has the advantage of being rather easy to explain and to understand. Of course, if you have a large number of trials, even small deviations from the target coverage will be statistically significant, which is why you'll need to think about what deviation actually is significant from a business perspective, as per above.

Alternatively, quantile forecasts (say, a 2.5% and a 97.5% quantile forecast, to yield a 95% PI) arise naturally as optimal point forecasts under certain loss functions, which are parameterized based on the target quantile. This paper gives a nice overview. This may be an alternative to the Bernoulli tests above: find the correct loss function for your target upper and lower quantile, then evaluate the two endpoints of your PIs under these loss functions. However, the loss functions are rather abstract and not easily understood, especially for nontechnical audiences.

If you are comparing, say, multiple forecasting methods, you could first discard those whose PIs significantly underperform, based on Bernoulli hypothesis tests or loss functions, then assess the ones that passed this initial screening based on the width of their PIs. Among two PIs with the same correct coverage rate, the narrower one is usually better.

For a simple evaluation of PIs using null hypothesis significance tests, see this paper. There are also some far more elaborate schemes for evaluating PIs, which can also deal with serial dependence in deviations in coverage (maybe your financial PIs are good some part of the year, but bad at specific times), like this paper and that paper. Unfortunately, these require quite a large number of PI/realizations and so are likely only relevant for high-frequency financial data, like stock prices reported multiple times per day.

Finally, there has recently been some interest in going beyond PIs to the underlying predictive densities, which can be evaluated using (proper) scoring rules. Tilmann Gneiting has been very active in this area, and this paper of his gives a good introduction. However, even if you do decide to go deeper into predictive densities, scoring rules are again quite abstract and hard to communicate to a nontechnical audience.


EDIT - an update:

Your quality measure needs to balance coverage and length of the prediction intervals: yes, we want high coverage, but we also want short intervals.

There is a quality measure that does precisely this and has attractive properties: the interval score. Let $\ell$ and $u$ be the lower and the upper end of the prediction interval. The score is given by

$$ S(\ell,u,h) = (u-\ell)+\frac{2}{\alpha}(\ell-h)1(h<\ell)+\frac{2}{\alpha}(h-u)1(h>u). $$

Here $1$ is the indicator function, and $\alpha$ is the coverage your algorithm is aiming for. (You will need to prespecify this, based on what you plan on doing with the prediction interval. It makes no sense to aim for $\alpha=100\%$ coverage, because the resulting intervals will be too wide to be useful for anything.)

You can then average the interval score over many predictions. The lower the average score, the better. See Gneiting & Raftery (2007, JASA)] for a discussion and pointers to further literature. A scaled version of this score was used, for instance, in assessing predictions intervals in the recent M4 forecasting competition.

(Full disclosure: this was shamelessly cribbed from this answer of mine.)