Time Series – Interpretation of Mean Absolute Scaled Error (MASE)

accuracyforecastingmasetime series

Mean absolute scaled error (MASE) is a measure of forecast accuracy proposed by Koehler & Hyndman (2006).

$$MASE=\frac{MAE}{MAE_{in-sample, \, naive}}$$

where $MAE$ is the mean absolute error produced by the actual forecast;

while $MAE_{in-sample, \, naive}$ is the mean absolute error produced by a naive forecast (e.g. no-change forecast for an integrated $I(1)$ time series), calculated on the in-sample data.

(Check out the Koehler & Hyndman (2006) paper for a precise definition and formula.)

$MASE>1$ implies that the actual forecast does worse out of sample than a naive forecast did in sample, in terms of mean absolute error. Thus if mean absolute error is the relevant measure of forecast accuracy (which depends on the problem at hand), $MASE>1$ suggests that the actual forecast should be discarded in favour of a naive forecast if we expect the out-of-sample data to be quite like the in-sample data (because we only know how well a naive forecast performed in sample, not out of sample).

Question:

$MASE=1.38$ was used as a benchmark in a forecasting competition proposed in this Hyndsight blog post. Shouldn't an obvious benchmark have been $MASE=1$?

Of course, this question is not specific to the particular forecasting competition. I would like some help on understanding this in a more general context.

My guess:

The only sensible explanation I see is that a naive forecast was expected to do quite worse out of sample than it did in sample, e.g. due to a structural change. Then $MASE<1$ might have been too challenging to achieve.

References:

Hyndman, Rob J., and Anne B. Koehler. "Another look at measures of forecast accuracy." International journal of forecasting 22.4 (2006): 679-688.
Hyndsight blog post.

Best Answer

In the linked blog post, Rob Hyndman calls for entries to a tourism forecasting competition. Essentially, the blog post serves to draw attention to the relevant IJF article, an ungated version of which is linked to in the blog post.

The benchmarks you refer to - 1.38 for monthly, 1.43 for quarterly and 2.28 for yearly data - were apparently arrived at as follows. The authors (all of them are expert forecasters and very active in the IIF - no snake oil salesmen here) are quite capable of applying standard forecasting algorithms or forecasting software, and they are probably not interested in simple ARIMA submission. So they went and applied some standard methods to their data. For the winning submission to be invited for a paper in the IJF, they ask that it improve on the best of these standard methods, as measured by the MASE.

So your question essentially boils down to:

Given that a MASE of 1 corresponds to a forecast that is out-of-sample as good (by MAD) as the naive random walk forecast in-sample, why can't standard forecasting methods like ARIMA improve on 1.38 for monthly data?

Here, the 1.38 MASE comes from Table 4 in the ungated version. It is the average ASE over 1-24 month ahead forecasts from ARIMA. The other standard methods, like ForecastPro, ETS etc. perform even worse.

And here, the answer gets hard. It is always very problematic to judge forecast accuracy without considering the data. One possibility I could think of in this particular case could be accelerating trends. Suppose that you try to forecast $\exp(t)$ with standard methods. None of these will capture the accelerating trend (and this is usually a Good Thing - if your forecasting algorithm often models an accelerating trend, you will likely far overshoot your mark), and they will yield a MASE that is above 1. Other explanations could, as you say, be different structural breaks, e.g., level shifts or external influences like SARS or 9/11, which would not be captured by the non-causal benchmark models, but which could be modeled by dedicated tourism forecasting methods (although using future causals in a holdout sample is a kind of cheating).

So I'd say that you likely can't say a lot about this withough looking at the data themselves. They are available on Kaggle. Your best bet is likely to take these 518 series, hold out the last 24 months, fit ARIMA series, calculate MASEs, dig out the ten or twenty MASE-worst forecast series, get a big pot of coffee, look at these series and try to figure out what it is that makes ARIMA models so bad at forecasting them.

EDIT: another point that appears obvious after the fact but took me five days to see - remember that the denominator of the MASE is the one-step ahead in-sample random walk forecast, whereas the numerator is the average of the 1-24-step ahead forecasts. It's not too surprising that forecasts deteriorate with increasing horizons, so this may be another reason for a MASE of 1.38. Note that the Seasonal Naive forecast was also included in the benchmark and had an even higher MASE.

Related Solutions

Solved – use mean absolute scaled error (MASE) from the accuracy function for time series cross validation

accuracy() uses the training sample on which a particular forecast is based. As you note, this will change in each iteration of your rolling origin evaluation. So indeed, you need to "roll your own".

This is less onerous than it looks like. Note that the MASE is the MAE or MAD, divided by some scaling factor. accuracy() will give you the MAE. (Of course you can use this in rolling origin evaluation.) So you only need to calculate a single scaling factor for each time series, which you will then apply to the rolling origin MAEs.

For instance, the "classical" scaling factor Hyndman & Koehler (2006) originally proposed is simply the in-sample MAE of the naive random walk forecast. If the data you want to base this on (e.g., the "common" history) is

foo <- ts(rnorm(1,20))

then you can calculate this very easily by

mean(abs(foo[-1]-foo[-20]))

(Note: it should be possible to use tail(foo,-1)-head(foo,-1), but this yields a vector of all zeros... I wonder whether something is buggy here, though both tail(foo,-1) and head(foo,-1) look fine. Weird.)

(Update: tail(foo,-1)-head(foo,-1) works on plain vanilla R, but not if I load the forecast 8.5 package. Sounds like a bug. I'll inform Rob Hyndman.)

(Update 2: I got the following answer back from Rob -

This is a feature, not a bug.

By defining head.ts and tail.ts, we are ensuring that head() and tail() on ts objects now retain their class (essential for plotting, etc.). By retaining the "ts" class, the - operation will now use Ops.ts. This will compute math operations with respect to time rather than with respect to position in the vector.

)

Solved – How to interpret MASE for longer horizon forecasts

The MASE compares "your" forecast against a naive benchmark forecast calculated in-sample.

The original paper by Hyndman & Kohler (2006) chose the one-step ahead random walk forecast as this benchmark, but this is not set in stone. For instance, if you have an obviously seasonal time series, then the random walk makes little sense as a benchmark, and it would be better to use a seasonal naive forecast (i.e., the value from one season back), like Hyndman does in this paper mentioned in this earlier CV question about the MASE. Rob Hyndman doesn't write so explicitly in his online textbook, but he has repeatedly made this point in discussions about the MASE.

In your case, I'd use a naive 20-step-ahead forecast as the benchmark, i.e., your $hMASE$. Just make it clear in your writeup what the benchmark is that you are comparing "your" forecast against.

Best Answer

Related Solutions

Solved – use mean absolute scaled error (MASE) from the accuracy function for time series cross validation

Solved – How to interpret MASE for longer horizon forecasts

Related Question