Solved – When is the AIC a good model selection criterion for forecasting and when is it not

aicbicforecastingtime series

I'm trying to wrap my head around why the AIC and other similar ICs work as proxies for out of sample error when trying to perform automated forecast generation.

So I performed an experiment on the Air Passengers data set, which is as forecastable as a real world time series can get. I used data through the end of 1958 as the training set and the data from 1959 and 1960 as the hold out set.

The results I got were surprising:

Using auto.arima() (with stepwise=FALSE and approximation=FALSE), the best fitting model selected was an $ARIMA(0,1,3)(0,1,0)_{12}$ model, with the AIC = 802.346 and a test RMSE = 69.235753.

I then fit an "over the top" ARIMA model with extreme parameters $ARIMA(15,1,15)(4,1,4)_{12}$. As expected, this lead to a higher AIC of 829.2997, but with a much lower RMSE = 27.871115.

Then I tried a more reasonable ARIMA model (i.e. one within the range that would have been considered by auto.arima()), $ARIMA(2,1,2)(2,1,2)_{12}$ and it gave me an AIC=804.56, and an RMSE = 42.051491.

From the plots, of the forecasts, it is even more clear that the model with the highest AIC is the one giving the best forecast. The AICc and the BIC gave a similar reverse behavior, with the model with the highest IC value giving the lowest RMSE and MAPE.

My questions:

  • Isn't this the exact opposite of the intended behavior? I thought minimizing the AIC (or any of the other ICs) gives the best model, not maximizing it?
  • Is there something wrong with my experiment, that's giving me the counter intuitive result?
  • Does the fact that the Air Passengers time series is very regular and forecastable have something to do with this? Would the AIC work better for very noisy data?
  • When are the AIC and other ICs appropriate for time series model selection if not in this case?

enter image description here

#Call the necessary libraries 
library('ggplot2')  
library('forecast') 
library(zoo) 
library(scales) 
theme_set(theme_bw()) 

#load the airpassengers data 
data("AirPassengers") 

#Split the data into test and train 
train <- window(AirPassengers, end = c(1958, 12))
test <- window(AirPassengers, start = c(1959, 1), end = c(1960,12))


#Fit the models
fit <- auto.arima(train, stepwise = FALSE, approximation = FALSE)
fit2 <- Arima(train, order=c(15, 1, 15), seasonal = list (order= c(4, 1, 4) , period = 12), method='ML')
fit3 <- Arima(train, order=c(2, 1, 2), seasonal = list (order= c(2, 1, 2) , period = 12), method='ML')

#Generate forecasts 
#I am setting forecast intervals to 0 so that they are not displayed for better clarity  
arima_fct <- forecast(fit,level = c (0,0), h=24) 
arima_fct2 <- forecast(fit2,level = c (0,0), h=24) 
arima_fct3 <- forecast(fit3,level = c (0,0), h=24) 

fit$aic
fit2$aic
fit3$aic

accuracy(arima_fct,test)
accuracy(arima_fct2,test)
accuracy(arima_fct3,test)


  #Plot results 
  autoplot(arima_fct , ylab = 'Passengers') + scale_x_yearmon() + autolayer(test, series="Test Data") + autolayer(arima_fct$mean, series="ARIMA(0,1,3)(0,1,0)[12]: AIC = 802.3461, Test RMSE = 69.235753") + autolayer(arima_fct3$mean, series="ARIMA(2,1,2)(2,1,2)[12]: AIC = 804.56, Test RMSE = 42.051491") + autolayer(arima_fct2$mean, series="ARIMA(15,1,15)(4,1,4)[12]: AIC = 829.2997, Test RMSE = 27.871115") 

Best Answer

I am not completely satisfied with my answer, but here goes.

  1. To an extent, you are comparing apples and oranges. Your two calls to Arima() use method="ML", whereas your auto.arima() uses the default, which is method="CSS-ML". Then again, refitting everything with the default does not make a real difference.

  2. Minimizing the AIC is asymptotically equivalent to minimizing the one-step ahead squared prediction error. (I don't have a reference at hand, sorry.) Note that this is an asymptotic result in a suitable statistical sense. It's quite possible for a handpicked model to outperform AIC on a limited length time series. And on a single one, at that.

  3. Finally, as you write in a comment, the AirPassengers dataset exhibits strong multiplicative seasonality. ARIMA does not model multiplicative seasonality or trend; it can only deal with additive effects. Your overparameterized model gets the multiplicative trend and seasonality right, but it may also forecast this in a series that does not exhibit such effects. There are reasons why such large models are typically not considered.

    To model multiplicative effects, allow auto.arima() to use Box-Cox transformations:

    > (foo <- auto.arima(train,lambda="auto"))
    Series: train 
    ARIMA(0,1,1)(0,1,1)[12] 
    Box Cox transformation: lambda= -0.3096628 
    
    Coefficients:
              ma1     sma1
          -0.3936  -0.5713
    s.e.   0.1035   0.0863
    
    > accuracy(forecast(foo,h=24,biasadj=TRUE),test)
                         ME      RMSE       MAE        MPE     MAPE      MASE       ACF1 Theil's U
    Training set -0.7186038  8.915531  6.691014 -0.2079082 2.753580 0.2341638 0.04889565        NA
    Test set     28.5600533 31.711896 28.884516  6.2710488 6.348486 1.0108644 0.17279165 0.6372069
    

    I cut out the AIC, because that is not comparable to the AIC on nontransformed data. Note that we end up much closer to your large model in terms of the test RMSE, but the model is much more interpretable, and I personally would trust it a lot more than an ARIMA(15,1,15)(4,1,4)[12] one. Incidentally, searching through more possible ARIMA models yields the exact same model:

    > (bar <- auto.arima(train,max.p=15,max.q=15,max.P=4,max.Q=4,
    + lambda="auto",stepwise=FALSE,approximation=FALSE))
    Series: train 
    ARIMA(0,1,1)(0,1,1)[12] 
    Box Cox transformation: lambda= -0.3096628 
    
    Coefficients:
              ma1     sma1
          -0.3936  -0.5713
    s.e.   0.1035   0.0863