I have recently built a model, designed for prediction. Initially, I chose model A over B – better RMSE and better MAPE. However, after carefully evaluating each prediction on my test dataset for those two models, I have concluded that prediction accuracy is consistently higher for model B in terms of those two statistics on most of test dataset observations, except for last few outliers, which blurred the single-numbered statistics. Excluding 10 worst observations from calculating RMSE/MAPE led me to chosing B over A at the end.
I have applied solution that required looking at each observation and compare fit errors in tail of error distribution. The simpler solution can be to calculate statistic on first 90-95% of best fits. Are there any other, better, more grounded in statistics theory solutions?
In case you ask, because I asked myself, why would I want to be blind for observations I am making greatest errors at. Answer is: dependent variable for those observations was probably flawed (wrongly calculated) and my prediction is closer to truth than original value was. But I could only make such a conclusion after I fit the model.
Best Answer
there is a relationship between RMSE (root mean square error) and MAE (mean absolute error) that could help you in choosing between these.
MAE ≤ RMSE ≤ sqrt(n)·MAE , where the most extreme difference occurs when all the errors are in one observation, and the rest of the errors are zero. Thus RMSE can increase with the number of observations, even if the underlying stochastic process is unchanged. This does not happen for MAE.
When the errors are normal distributed, this effect is very small, but for errors that are more fat-tailed, this effect can be problematic. Especially since it makes it difficult to compare samples with different number of observations.
This is well explained in this paper by Willmott & Matsuura
Also it's quite easy to simulate this effect in R:
This simulates errors that are t-distributed with 3 degrees of freedom (to get a reasonable fat tail). it runs each calculation 10,000 times and calculates the average. When this is done for the sample sizes: 30, 100, 1000 and 10000 we get the folowing result:
The results shows a clear increase in RMSQ as the number of observations increase, but this is not the case for MAE. If one replaces the t-distribution in the code with a normal distribution, one can see that this effect all but disapears.
Based on this result, and also that it's easier to have an intuitive understanding of the MAE result, I would go for MAE.
Hope this is of help. Regards, Morten Bunes Gustavsen