My task is to forecast future 1 month stock required for retail store, at a daily basis. How do I decide whether
MAPE, SMAPE and MASE
is a good metrics for the scenario?
In my context, over-forecast is better than under-forecast.
forecastingmapemasemodel-evaluationtime series
My task is to forecast future 1 month stock required for retail store, at a daily basis. How do I decide whether
MAPE, SMAPE and MASE
is a good metrics for the scenario?
In my context, over-forecast is better than under-forecast.
MASE compares the forecasts to those obtained from a naive method. The naive method turns out to be very poor for white noise, but not so bad for an AR(1) with $\phi=0.7$. Consequently, the forecasts for the AR have a worse MASE than the forecasts for the white noise.
We can make this more precise as follows.
Let $y_1,y_2,\dots,y_{T}$ be a non-seasonal time series process observed to time $T$. Then MASE is defined as $$ \text{MASE} = \frac{1}{K}\sum_{k=1}^K |y_{T+k} - \hat{y}_{T+k|T}| / Q $$ where $Q$ is a scaling factor equal to the in-sample one-step naive forecast error, $$ Q = \frac{1}{T-1} \sum_{t=2}^T |y_t-y_{t-1}|, $$ and $\hat{y}_{T+k|T}$ is an estimate of $y_{T+k}$ given the observations $y_1,\dots,y_T$.
MASE provides a measure of how accurate forecasts are for a given series and the $Q$ scaling is intended to allow comparisons between series of different scales.
Suppose $y_t$ is standard Gaussian white noise $N(0,1)$. Then the data has variance 1, and the optimal forecast is $\hat{y}_{T+k|T}=0$ with forecast variance $v_{T+k|T} = 1$. Therefore $\text{E}|y_{T+k} - \hat{y}_{T+k|T}| = \sqrt{2/\pi}$ and $y_t-y_{t-1}\sim N(0,2)$. Thus the scaling factor has mean $\text{E}(Q) = 2/\sqrt{\pi}$, so that MASE has asymptotic mean $1/\sqrt{2}\approx 0.707$ (as $T\rightarrow\infty$). Note also that the long-term forecast variance $v_{T+\infty|T}=1$ is less than the in-sample naive forecast variance of 2.
But suppose $y_t$ is an AR(1) process defined as $y_t = \phi y_{t-1} + e_t$ where $e_t$ is Gaussian white noise $N(0,\sigma^2)$. Then the data has variance $\sigma^2/(1-\phi^2)$, and optimal forecast is $\hat{y}_{T+k|T} = \phi^k y_{T}$ with variance $v_{T+k|T} = \sigma^2(1-\phi^{2k})/(1-\phi^2)$. Therefore $\text{E}|y_{T+k} - \hat{y}_{T+k|T}| = \sigma\sqrt{2(1-\phi^{2k})/[(1-\phi^2)\pi]}$ and $y_t-y_{t-1} \sim N(0, 2\sigma^2/(1+\phi))$. Thus the scaling factor has mean $\text{E}(Q) = 2\sigma/\sqrt{\pi(1+\phi)}$.
For large $k$, if $\sigma^2 = 1-\phi^2$ then $v_{T+k|T} \approx 1$, $\text{E}(Q) \approx 2\sqrt{(1-\phi)/\pi}\}$ and $\text{E}|y_{T+k} - \hat{y}_{T+k|T}| \approx \sqrt{2/\pi}$. So the asymptotic MASE (as $K\rightarrow\infty$ and $T\rightarrow\infty$) has mean of $$1 / \sqrt{2(1-\phi)}$$ which is approximately 1.29 for $\phi=0.7$.
This is a good question. Unfortunately, while the academic forecasting literature is indeed (slowly) moving from an almost exclusive emphasis on point forecasts towards interval forecasts and predictive densities, there has been little work on evaluating interval forecasts. (EDIT: See the bottom of this answer for an update.)
As gung notes, whether or not a given 95% prediction interval contains the true actual is in principle a Bernoulli trial with $p=0.95$. Given enough PIs and realizations, you can in principle test the null hypothesis that the actual coverage probability is in fact at least 95%.
However, you will need to think about statistical power and sample sizes. It's probably best to decide beforehand what kind of deviation from the target coverage is still acceptable (is 92% OK? 90%?), then find the minimum sample size needed to detect a deviation that is this strong or stronger with a given probability, say $\beta=80\%$, which is standard. You can do this by a straightforward simulation: simulate $n$ Bernoulli trials with $p=0.92$, estimate $\hat{p}$ with a confidence interval for it, see whether it contains the value 95%, do this "often", and tweak $n$ until 95% is outside the CI in $\beta=80\%$ of your cases. Or use any Bernoulli power calculator.
OK, now that we have our sample size, you can batch your PIs and realizations in batches of this sample size, see how often your PIs contain the true realization, and start testing. Your batches can be the last $n$ PI/realizations of a single time series, or all the latest PIs/realizations of a large number of time series you are forecasting, or whatever.
This approach has the advantage of being rather easy to explain and to understand. Of course, if you have a large number of trials, even small deviations from the target coverage will be statistically significant, which is why you'll need to think about what deviation actually is significant from a business perspective, as per above.
Alternatively, quantile forecasts (say, a 2.5% and a 97.5% quantile forecast, to yield a 95% PI) arise naturally as optimal point forecasts under certain loss functions, which are parameterized based on the target quantile. This paper gives a nice overview. This may be an alternative to the Bernoulli tests above: find the correct loss function for your target upper and lower quantile, then evaluate the two endpoints of your PIs under these loss functions. However, the loss functions are rather abstract and not easily understood, especially for nontechnical audiences.
If you are comparing, say, multiple forecasting methods, you could first discard those whose PIs significantly underperform, based on Bernoulli hypothesis tests or loss functions, then assess the ones that passed this initial screening based on the width of their PIs. Among two PIs with the same correct coverage rate, the narrower one is usually better.
For a simple evaluation of PIs using null hypothesis significance tests, see this paper. There are also some far more elaborate schemes for evaluating PIs, which can also deal with serial dependence in deviations in coverage (maybe your financial PIs are good some part of the year, but bad at specific times), like this paper and that paper. Unfortunately, these require quite a large number of PI/realizations and so are likely only relevant for high-frequency financial data, like stock prices reported multiple times per day.
Finally, there has recently been some interest in going beyond PIs to the underlying predictive densities, which can be evaluated using (proper) scoring rules. Tilmann Gneiting has been very active in this area, and this paper of his gives a good introduction. However, even if you do decide to go deeper into predictive densities, scoring rules are again quite abstract and hard to communicate to a nontechnical audience.
EDIT - an update:
Your quality measure needs to balance coverage and length of the prediction intervals: yes, we want high coverage, but we also want short intervals.
There is a quality measure that does precisely this and has attractive properties: the interval score. Let $\ell$ and $u$ be the lower and the upper end of the prediction interval. The score is given by
$$ S(\ell,u,h) = (u-\ell)+\frac{2}{\alpha}(\ell-h)1(h<\ell)+\frac{2}{\alpha}(h-u)1(h>u). $$
Here $1$ is the indicator function, and $\alpha$ is the coverage your algorithm is aiming for. (You will need to prespecify this, based on what you plan on doing with the prediction interval. It makes no sense to aim for $\alpha=100\%$ coverage, because the resulting intervals will be too wide to be useful for anything.)
You can then average the interval score over many predictions. The lower the average score, the better. See Gneiting & Raftery (2007, JASA)] for a discussion and pointers to further literature. A scaled version of this score was used, for instance, in assessing predictions intervals in the recent M4 forecasting competition.
(Full disclosure: this was shamelessly cribbed from this answer of mine.)
Best Answer
You are forecasting for stock control, so you need to think about setting safety amounts. In my opinion, a quantile forecast is far more important in this situation than a forecast of some central tendency (which the accuracy KPIs you mention assess).
You essentially have two or three possibilities.
Directly forecast high quantiles of your unknown future distribution. There are more and more papers on this. I'll attach some below.
Regarding your question, you can assess the quality of quantile forecasts using hinge loss functions, which are also used in quantile regression. Take a look at the papers by Ehm et al. (2016) and Gneiting (2011) below.
Forecast some central tendency, e.g., the conditional expectation, plus higher moments as necessary, and combine these with an appropriate distributional assumption to obtain quantiles or safety amounts. For instance, you could forecast the conditional mean and the conditional variance and use a normal or negative-binomial distribution to set target service levels.
In this case, you can use a forecast accuracy KPI that is consistent with the measure of central tendency you are forecasting for. For instance, if you try to forecast the conditional expectation, you can assess it using the MSE. Or you could forecast the conditional median and assess this using the MAE, wMAPE or MASE. See Kolassa (2019) on why this sounds so complicated. And you will still need to assess whether your forecasts of higher moments (e.g., the variance) are correct. Probably best to directly evaluate the quantiles this approach yields by the methods discussed above.
Forecast full predictive densities, from which you can derive all quantiles you need. This is what I argue for in Kolassa (2016).
You can evaluate predictive densities using proper scoring rules. See Kolassa (2016) for details and pointers to literature. The problem is that these are far less intuitive than the point forecast error measures discussed above.
What are the shortcomings of the Mean Absolute Percentage Error (MAPE)? is likely helpful, and also contains more information. If you are forecasting for a single store, I suspect that the MAPE will often be undefined, because of zero demands (that you would need to divide by).
References
(sorry for not nicely formatting these)
Ehm, W.; Gneiting, T.; Jordan, A. & Krüger, F. Of quantiles and expectiles: consistent scoring functions, Choquet representations and forecast rankings (with discussion). Journal of the Royal Statistical Society, Series B, 2016 , 78 , 505-562
Gneiting, T. Quantiles as optimal point forecasts. International Journal of Forecasting, 2011 , 27 , 197-207
Kolassa, S. Why the "best" point forecast depends on the error or accuracy measure. International Journal of Forecasting, 2020 , 36, 208-211
Kolassa, S. Evaluating Predictive Count Data Distributions in Retail Sales Forecasting. International Journal of Forecasting, 2016 , 32 , 788-803
The following are more generally on quantile forecasting:
Trapero, J. R.; Cardós, M. & Kourentzes, N. Quantile forecast optimal combination to enhance safety stock estimation. International Journal of Forecasting, 2019 , 35 , 239-250
Bruzda, J. Quantile smoothing in supply chain and logistics forecasting. International Journal of Production Economics, 2019 , 208 , 122 - 139
Kourentzes, N.; Trapero, J. R. & Barrow, D. K. Optimising forecasting models for inventory planning. Lancaster University Management School, Lancaster University Management School, 2019
Ulrich, M.; Jahnke, H.; Langrock, R.; Pesch, R. & Senge, R. Distributional regression for demand forecasting -- a case study. 2018
Bruzda, J. Multistep quantile forecasts for supply chain and logistics operations: bootstrapping, the GARCH model and quantile regression based approaches. Central European Journal of Operations Research, 2018