Solved – What are the shortcomings of the Mean Absolute Percentage Error (MAPE)

accuracymaemapemasemse

The Mean Absolute Percentage Error (mape) is a common accuracy or error measure for time series or other predictions,

$$ \text{MAPE} = \frac{100}{n}\sum_{t=1}^n\frac{|A_t-F_t|}{A_t}\%,$$

where $A_t$ are actuals and $F_t$ corresponding forecasts or predictions.

The MAPE is a percentage, so we can easily compare it between series, and people can easily understand and interpret percentages.

However, I hear that the MAPE has drawbacks. I'd like to understand these drawbacks better so I can make an informed decision about whether to use the MAPE or some alternative like the MSE (mse), the MAE (mae) or the MASE (mase).

Best Answer

Shortcomings of the MAPE

The MAPE, as a percentage, only makes sense for values where divisions and ratios make sense. It doesn't make sense to calculate percentages of temperatures, for instance, so you shouldn't use the MAPE to calculate the accuracy of a temperature forecast.
If just a single actual is zero, $A_t=0$, then you divide by zero in calculating the MAPE, which is undefined.

It turns out that some forecasting software nevertheless reports a MAPE for such series, simply by dropping periods with zero actuals (Hoover, 2006). Needless to say, this is not a good idea, as it implies that we don't care at all about what we forecasted if the actual was zero - but a forecast of $F_t=100$ and one of $F_t=1000$ may have very different implications. So check what your software does.

If only a few zeros occur, you can use a weighted MAPE (Kolassa & Schütz, 2007), which nevertheless has problems of its own. This also applies to the symmetric MAPE (Goodwin & Lawton, 1999).
MAPEs greater than 100% can occur. If you prefer to work with accuracy, which some people define as 100%-MAPE, then this may lead to negative accuracy, which people may have a hard time understanding. (No, truncating accuracy at zero is not a good idea.)
Model fitting relies on minimizing errors, which is often done using numerical optimizers that use first or second derivatives. The MAPE is not everywhere differentiable, and its Hessian is zero wherever it is defined. This can throw optimizers off if we want to use the MAPE as an in-sample fit criterion.

A possible mitigation may be to use the log cosh loss function, which is similar to the MAE but twice differentiable. Alternatively, Zheng (2011) offer a way to approximate the MAE (or any other quantile loss) to arbitrary precision using a smooth function. If we know bounds on the actuals (which we do when fitting strictly positive historical data), we can therefore smoothly approximate the MAPE to arbitrary precision.
If we have strictly positive data we wish to forecast (and per above, the MAPE doesn't make sense otherwise), then we won't ever forecast below zero. Now, the MAPE treats overforecasts differently than underforecasts: an underforecast will never contribute more than 100% (e.g., if $F_t=0$ and $A_t=1$), but the contribution of an overforecast is unbounded (e.g., if $F_t=5$ and $A_t=1$). This means that the MAPE may be lower for biased than for unbiased forecasts. Minimizing it may lead to forecasts that are biased low.

Especially the last bullet point merits a little more thought. For this, we need to take a step back.

To start with, note that we don't know the future outcome perfectly, nor will we ever. So the future outcome follows a probability distribution. Our so-called point forecast $F_t$ is our attempt to summarize what we know about the future distribution (i.e., the predictive distribution) at time $t$ using a single number. The MAPE then is a quality measure of a whole sequence of such single-number-summaries of future distributions at times $t=1, \dots, n$.

The problem here is that people rarely explicitly say what a good one-number-summary of a future distribution is.

When you talk to forecast consumers, they will usually want $F_t$ to be correct "on average". That is, they want $F_t$ to be the expectation or the mean of the future distribution, rather than, say, its median.

Here's the problem: minimizing the MAPE will typically not incentivize us to output this expectation, but a quite different one-number-summary (McKenzie, 2011, Kolassa, 2020). This happens for two different reasons.

Asymmetric future distributions. Suppose our true future distribution follows a stationary $(\mu=1,\sigma^2=1)$ lognormal distribution. The following picture shows a simulated time series, as well as the corresponding density.

The horizontal lines give the optimal point forecasts, where "optimality" is defined as minimizing the expected error for various error measures.

The dashed line at $F_t=\exp(\mu+\frac{\sigma^2}{2})\approx 4.5$ minimizes the expected MSE. It is the expectation of the time series.
The dotted line at $F_t=\exp\mu\approx 2.7$ minimizes the expected MAE. It is the median of the time series.
The dash-dotted line at $F_t=\exp(\mu-\sigma^2)=1.0$ minimizes the expected MAPE. It is the (-1)-median of the time series (Gneiting, 2011, p. 752 with $\beta=-1$), which in the specific case of a lognormal distribution happens to coincide with the mode of the distribution.

We see that the asymmetry of the future distribution, together with the fact that the MAPE differentially penalizes over- and underforecasts, implies that minimizing the MAPE will lead to heavily biased forecasts. (Here is the calculation of optimal point forecasts in the gamma case.)

Symmetric distribution with a high coefficient of variation. Suppose that $A_t$ comes from rolling a standard six-sided die at each time point $t$. The picture below again shows a simulated sample path:

In this case:

The dashed line at $F_t=3.5$ minimizes the expected MSE. It is the expectation of the time series.
Any forecast $3\leq F_t\leq 4$ (not shown in the graph) will minimize the expected MAE. All values in this interval are medians of the time series.
The dash-dotted line at $F_t=2$ minimizes the expected MAPE.

We again see how minimizing the MAPE can lead to a biased forecast, because of the differential penalty it applies to over- and underforecasts. In this case, the problem does not come from an asymmetric distribution, but from the high coefficient of variation of our data-generating process.

This is actually a simple illustration you can use to teach people about the shortcomings of the MAPE - just hand your attendees a few dice and have them roll. See Kolassa & Martin (2011) for more information.

R code

Lognormal example:

mm <- 1
ss.sq <- 1
SAPMediumGray <- "#999999"; SAPGold <- "#F0AB00"

set.seed(2013)
actuals <- rlnorm(100,meanlog=mm,sdlog=sqrt(ss.sq))

opar <- par(mar=c(3,2,0,0)+.1)
    plot(actuals,type="o",pch=21,cex=0.8,bg="black",xlab="",ylab="",xlim=c(0,150))
    abline(v=101,col=SAPMediumGray)

    xx <- seq(0,max(actuals),by=.1)
    polygon(c(101+150*dlnorm(xx,meanlog=mm,sdlog=sqrt(ss.sq)),
      rep(101,length(xx))),c(xx,rev(xx)),col="lightgray",border=NA)

    (min.Ese <- exp(mm+ss.sq/2))
    lines(c(101,150),rep(min.Ese,2),col=SAPGold,lwd=3,lty=2)
    
    (min.Eae <- exp(mm))
    lines(c(101,150),rep(min.Eae,2),col=SAPGold,lwd=3,lty=3)
    
    (min.Eape <- exp(mm-ss.sq))
    lines(c(101,150),rep(min.Eape,2),col=SAPGold,lwd=3,lty=4)
par(opar)

Dice rolling example:

SAPMediumGray <- "#999999"; SAPGold <- "#F0AB00"

set.seed(2013)
actuals <- sample(x=1:6,size=100,replace=TRUE)

opar <- par(mar=c(3,2,0,0)+.1)
    plot(actuals,type="o",pch=21,cex=0.8,bg="black",xlab="",ylab="",xlim=c(0,150))
    abline(v=101,col=SAPMediumGray)

    min.Ese <- 3.5
    lines(c(101,150),rep(min.Ese,2),col=SAPGold,lwd=3,lty=2)
    
    min.Eape <- 2
    lines(c(101,150),rep(min.Eape,2),col=SAPGold,lwd=3,lty=4)
par(opar)

References

Gneiting, T. Making and Evaluating Point Forecasts. Journal of the American Statistical Association, 2011, 106, 746-762

Goodwin, P. & Lawton, R. On the asymmetry of the symmetric MAPE. International Journal of Forecasting, 1999, 15, 405-408

Hoover, J. Measuring Forecast Accuracy: Omissions in Today's Forecasting Engines and Demand-Planning Software. Foresight: The International Journal of Applied Forecasting, 2006, 4, 32-35

Kolassa, S. Why the "best" point forecast depends on the error or accuracy measure (Invited commentary on the M4 forecasting competition). International Journal of Forecasting, 2020, 36(1), 208-211

Kolassa, S. & Martin, R. Percentage Errors Can Ruin Your Day (and Rolling the Dice Shows How). Foresight: The International Journal of Applied Forecasting, 2011, 23, 21-29

Kolassa, S. & Schütz, W. Advantages of the MAD/Mean ratio over the MAPE. Foresight: The International Journal of Applied Forecasting, 2007, 6, 40-43

McKenzie, J. Mean absolute percentage error and bias in economic forecasting. Economics Letters, 2011, 113, 259-262

Zheng, S. Gradient descent algorithms for quantile regression with smooth approximation. International Journal of Machine Learning and Cybernetics, 2011, 2, 191-207

Solved – What are the shortcomings of the Mean Absolute Percentage Error (MAPE)

Best Answer

Shortcomings of the MAPE

Related CrossValidated questions

R code

References

Related Question

Best Answer

Shortcomings of the MAPE

Related CrossValidated questions

R code

References

Related Solutions

Solved – Which forecasting method should be selected in case of contradictory results from different accuracy measures

Time Series – Interpretation of Mean Absolute Scaled Error (MASE)

Related Question