Solved – Best statistic for measuring prediction accuracy that is robust for outliers

descriptive statisticserrormseprediction

I have recently built a model, designed for prediction. Initially, I chose model A over B – better RMSE and better MAPE. However, after carefully evaluating each prediction on my test dataset for those two models, I have concluded that prediction accuracy is consistently higher for model B in terms of those two statistics on most of test dataset observations, except for last few outliers, which blurred the single-numbered statistics. Excluding 10 worst observations from calculating RMSE/MAPE led me to chosing B over A at the end.

I have applied solution that required looking at each observation and compare fit errors in tail of error distribution. The simpler solution can be to calculate statistic on first 90-95% of best fits. Are there any other, better, more grounded in statistics theory solutions?

In case you ask, because I asked myself, why would I want to be blind for observations I am making greatest errors at. Answer is: dependent variable for those observations was probably flawed (wrongly calculated) and my prediction is closer to truth than original value was. But I could only make such a conclusion after I fit the model.

Best Answer

there is a relationship between RMSE (root mean square error) and MAE (mean absolute error) that could help you in choosing between these.

MAE ≤ RMSE ≤ sqrt(n)·MAE , where the most extreme difference occurs when all the errors are in one observation, and the rest of the errors are zero. Thus RMSE can increase with the number of observations, even if the underlying stochastic process is unchanged. This does not happen for MAE.

When the errors are normal distributed, this effect is very small, but for errors that are more fat-tailed, this effect can be problematic. Especially since it makes it difficult to compare samples with different number of observations.

This is well explained in this paper by Willmott & Matsuura

Also it's quite easy to simulate this effect in R:

numObs<-30
sumRMSQ<-0
sumMAE<-0
for(i in 1:10000){

    TestError<-rt(numObs, 3)

    currentRMSQ<-sqrt(mean(TestError^2))
    sumRMSQ<-sumRMSQ+currentRMSQ

    currentMAE<-mean(abs(TestError))
    sumMAE<-sumMAE+currentMAE

}
avgRMSQ<-sumRMSQ/10000
cat("RMSQ numObs: ",numObs, avgRMSQ,"\n")

avgMAE<-sumMAE/10000
cat("MAE numObs: ",numObs, avgMAE,"\n")

This simulates errors that are t-distributed with 3 degrees of freedom (to get a reasonable fat tail). it runs each calculation 10,000 times and calculates the average. When this is done for the sample sizes: 30, 100, 1000 and 10000 we get the folowing result:

numObs:  30    RMSQ: 1.614496
numObs:  100   RMSQ: 1.655523
numObs:  1000  RMSQ: 1.702508
numObs:  10000 RMSQ: 1.725051    

numObs:  30    MAE:  1.106086
numObs:  100   MAE:  1.10151
numObs:  1000  MAE:  1.102015
numObs:  10000 MAE:  1.10287    

The results shows a clear increase in RMSQ as the number of observations increase, but this is not the case for MAE. If one replaces the t-distribution in the code with a normal distribution, one can see that this effect all but disapears.

Based on this result, and also that it's easier to have an intuitive understanding of the MAE result, I would go for MAE.

Hope this is of help. Regards, Morten Bunes Gustavsen

Related Question