Solved – Interpretation of (scale of) AIC, AICc and BIC when comparing different models

aicmodel comparisontime series

I'm trying to fit a model to a time series, but I am pretty confused as to which is the best.

I'm looking at an arima model, and ets model and an stlf model, which each performed best within their own family of models. When comparing rolling forecasting errors for 6 month forecasts, they perform exactly equally well, each model has the smallest errors exactly one third of the time.

I then try to look at other criteria such as AIC, AICc and BIC, and get the following results (my problem is really the scale of the information criteria – it's about a factor hundred smaller for the stlf model, is it really that much better or is something else at play here?):

#The arima model:
Series: myts
ARIMA(0,1,0)(1,0,0)[12]
Coefficients:
        sar1
      0.8394
s.e.  0.0704
sigma^2 estimated as 19456:  log likelihood=-229.81
AIC=463.61   AICc=463.99   BIC=466.72

#The ets model:
ETS(M,N,M)
Call:
 ets(y = myts)
  Smoothing parameters:
    alpha = 0.5505
    gamma = 1e-04
  Initial states:
    l = 500.5273
    s=0.5977 0.3134 0.298 0.5218 1.6367 2.0899
           2.1506 2.2123 0.8724 0.5279 0.4086 0.3708

  sigma:  0.1507

     AIC     AICc      BIC
438.9330 458.9330 461.1023

#The stlf model:
 ETS(A,N,N)
Call:
 ets(y = x, model = etsmodel)
  Smoothing parameters:
    alpha = 0.483
  Initial states:
    l = 6.0707
  sigma:  0.1587
      AIC      AICc       BIC
0.4533825 0.8170189 3.6204204 

Can they be compared at all? I do think I remember something about only being able to compare these criteria between different models under certain conditions.

Best Answer

You can't compare information criteria between different fitting methods. AIC and friends involve a constant that different fitting algorithms set to different values. You can compare AICs for different models fitted by the same method. So no help there.

Looking at rolling out-of-sample forecasts was already exactly the right thing to do. Now you know that each model is best one third of the time. You could now also look at the magnitude of the errors (MAD or MSE) - perhaps one model sometimes yields very low, sometimes very high forecasts.

Failing that, it may well be that all three methods are equally good.

One smart trick to improve forecast accuracy is: calculate forecasts from all three methods and average them within each future time bucket. Averaging forecasts, in particular from "very different" methods, almost always improves accuracy and also reduces error variance.

Related Question