Solved – Are models identified by auto.arima() parsimonious

aicarimaautomatic-algorithmsforecastingtime series

I have been trying to learn and apply ARIMA models. I have been reading an excellent text on ARIMA by Pankratz – Forecasting with Univariate Box – Jenkins Models: Concepts and Cases. In the text the author especially emphasizes the priciple of parsimony in choosing ARIMA models.

I started playing with auto.arima() function in R package forecast. Here is what I did, I simulated ARIMA and then applied auto.arima(). Below are 2 examples. As you can see in both example auto.arima() clearly identified a model that many would consider non-parsimonious. Especially in example 2, where auto.arima() identified ARIMA(3,0,3) when actually ARIMA(1,0,1) would be sufficient and parsimonious.

Below are my questions. I would appreciate any suggestions and recommendations.

  1. Are there any guidance on when to use/modify the models identified using automatic algorithms such as auto.arima()?
  2. Are there any pit falls in just using AIC (which is what I think auto.arima() uses) to identify models?
  3. Can an automatic algorithm built that is parsimonious?

By the way I used auto.arima() just as an example. This would apply to any automatic algorithm.

Below is Example #1:

set.seed(182)
y <- arima.sim(n=500,list(ar=0.2,ma=0.6),mean = 10)

auto.arima(y)

qa <- arima(y,order=c(1,0,1))
qa

Below are the results from auto.arima(). Please note that all the coefficients are insignificant. i.e., $t$ value < 2.

ARIMA(1,0,2) with non-zero mean 

Coefficients:
         ar1     ma1      ma2  intercept
      0.5395  0.2109  -0.3385    19.9850
s.e.  0.4062  0.4160   0.3049     0.0878

sigma^2 estimated as 1.076:  log likelihood=-728.14
AIC=1466.28   AICc=1466.41   BIC=1487.36

Below are the results from running regular arima() with order ARIMA(1,0,1)

Series: y 
ARIMA(1,0,1) with non-zero mean 

Coefficients:
         ar1     ma1  intercept
      0.2398  0.6478    20.0323
s.e.  0.0531  0.0376     0.1002

sigma^2 estimated as 1.071:  log likelihood=-727.1
AIC=1462.2   AICc=1462.28   BIC=1479.06

Example 2:

set.seed(453)
y <- arima.sim(n=500,list(ar=0.2,ma=0.6),mean = 10)

auto.arima(y)

qa <- arima(y,order=c(1,0,1))
qa

Below are the results from auto.arima():

ARIMA(3,0,3) with non-zero mean 

Coefficients:
         ar1      ar2     ar3     ma1     ma2     ma3  intercept
      0.7541  -1.0606  0.2072  0.1391  0.5912  0.5491    20.0326
s.e.  0.0811   0.0666  0.0647  0.0725  0.0598  0.0636     0.0939

sigma^2 estimated as 1.027:  log likelihood=-716.84
AIC=1449.67   AICc=1449.97   BIC=1483.39

Below are the results running regular arima() with order ARIMA(1,0,1)

Series: y 
ARIMA(1,0,1) with non-zero mean 

Coefficients:
         ar1     ma1  intercept
      0.2398  0.6478    20.0323
s.e.  0.0531  0.0376     0.1002

sigma^2 estimated as 1.071:  log likelihood=-727.1
AIC=1462.2   AICc=1462.28   BIC=1479.06

Best Answer

There are a couple of issues here. Firstly, don't presume that the simulated ARIMA is truly of the order you specify; you are taking a sample from the specified model and due to randomness, the best fitting model for the particular sample drawn may not be the one from which the simulations were drawn.

I mention this because of the second and more important issue: the auto.arima() function can estimate models via a more efficient fitting algorithm, using conditional sums of squares, to avoid excessive computational time for long series or for complex seasonal models. When this estimation process is in use, auto.arima() approximates the information criteria for a model (because the log likelihood of the model has not been computed). A simple heuristic is used to determine whether the conditional sums of squares estimation is active, if the user does not indicate which approach should be used.

The behaviour is controlled via argument approximation and the simple heuristic is (length(x)>100 | frequency(x)>12), hence approximation takes a value TRUE if the length of the series is greater than $n = 100$, or there are more than 12 observations within each year. As you simulated series with $n = 500$ but did not specify a value for the approximation argument, you ran auto.arima() with approximation = TRUE. This explains the apparently erroneous selection of a model with larger AIC, AICc, and BIC than the simpler model you fitted with arima().

For your example 1, we should have

> auto.arima(y, approximation = FALSE)
Series: y 
ARIMA(0,0,1) with non-zero mean 

Coefficients:
         ma1  intercept
      0.7166    19.9844
s.e.  0.0301     0.0797

sigma^2 estimated as 1.079:  log likelihood=-728.94
AIC=1463.87   AICc=1463.92   BIC=1476.52
> qa
Series: y 
ARIMA(1,0,1) with non-zero mean 

Coefficients:
         ar1     ma1  intercept
      0.0565  0.6890    19.9846
s.e.  0.0626  0.0456     0.0830

sigma^2 estimated as 1.078:  log likelihood=-728.53
AIC=1465.06   AICc=1465.14   BIC=1481.92

Hence auto.arima() has selected a more parsimonious model than the true model; an ARIMA(0, 0, 1) is chosen. But this is based on the information criteria and now they are in accordance; the selected model has lower AIC, AICc, and BIC, although the differences for AIC and AICc are small. At least now the selection is consistent with the norms for choosing models based on information criteria.

The reason for the MA(1) being chosen, I believe, relates to the first issue I mentioned; namely that the best fitting model to a sample drawn from a stated ARIMA(p, d, q) may not be of the same order as the true model. This is due to random sampling. Taking a longer series or a longer burn in period may help increase the chance that the true model is selected, but don't bank on it.

Regardless, the moral here is that when something looks obviously wrong, like in your question, do read the associated man page or documentation to assure yourself that you understand how the software works.

Related Question