I have been trying to learn and apply ARIMA models. I have been reading an excellent text on ARIMA by Pankratz – Forecasting with Univariate Box – Jenkins Models: Concepts and Cases. In the text the author especially emphasizes the priciple of parsimony in choosing ARIMA models.
I started playing with auto.arima()
function in R package forecast. Here is what I did, I simulated ARIMA and then applied auto.arima()
. Below are 2 examples. As you can see in both example auto.arima()
clearly identified a model that many would consider non-parsimonious. Especially in example 2, where auto.arima()
identified ARIMA(3,0,3) when actually ARIMA(1,0,1) would be sufficient and parsimonious.
Below are my questions. I would appreciate any suggestions and recommendations.
- Are there any guidance on when to use/modify the models identified using automatic algorithms such as
auto.arima()
? - Are there any pit falls in just using AIC (which is what I think
auto.arima()
uses) to identify models? - Can an automatic algorithm built that is parsimonious?
By the way I used auto.arima()
just as an example. This would apply to any automatic algorithm.
Below is Example #1:
set.seed(182)
y <- arima.sim(n=500,list(ar=0.2,ma=0.6),mean = 10)
auto.arima(y)
qa <- arima(y,order=c(1,0,1))
qa
Below are the results from auto.arima()
. Please note that all the coefficients are insignificant. i.e., $t$ value < 2.
ARIMA(1,0,2) with non-zero mean
Coefficients:
ar1 ma1 ma2 intercept
0.5395 0.2109 -0.3385 19.9850
s.e. 0.4062 0.4160 0.3049 0.0878
sigma^2 estimated as 1.076: log likelihood=-728.14
AIC=1466.28 AICc=1466.41 BIC=1487.36
Below are the results from running regular arima()
with order ARIMA(1,0,1)
Series: y
ARIMA(1,0,1) with non-zero mean
Coefficients:
ar1 ma1 intercept
0.2398 0.6478 20.0323
s.e. 0.0531 0.0376 0.1002
sigma^2 estimated as 1.071: log likelihood=-727.1
AIC=1462.2 AICc=1462.28 BIC=1479.06
Example 2:
set.seed(453)
y <- arima.sim(n=500,list(ar=0.2,ma=0.6),mean = 10)
auto.arima(y)
qa <- arima(y,order=c(1,0,1))
qa
Below are the results from auto.arima()
:
ARIMA(3,0,3) with non-zero mean
Coefficients:
ar1 ar2 ar3 ma1 ma2 ma3 intercept
0.7541 -1.0606 0.2072 0.1391 0.5912 0.5491 20.0326
s.e. 0.0811 0.0666 0.0647 0.0725 0.0598 0.0636 0.0939
sigma^2 estimated as 1.027: log likelihood=-716.84
AIC=1449.67 AICc=1449.97 BIC=1483.39
Below are the results running regular arima()
with order ARIMA(1,0,1)
Series: y
ARIMA(1,0,1) with non-zero mean
Coefficients:
ar1 ma1 intercept
0.2398 0.6478 20.0323
s.e. 0.0531 0.0376 0.1002
sigma^2 estimated as 1.071: log likelihood=-727.1
AIC=1462.2 AICc=1462.28 BIC=1479.06
Best Answer
There are a couple of issues here. Firstly, don't presume that the simulated ARIMA is truly of the order you specify; you are taking a sample from the specified model and due to randomness, the best fitting model for the particular sample drawn may not be the one from which the simulations were drawn.
I mention this because of the second and more important issue: the
auto.arima()
function can estimate models via a more efficient fitting algorithm, using conditional sums of squares, to avoid excessive computational time for long series or for complex seasonal models. When this estimation process is in use,auto.arima()
approximates the information criteria for a model (because the log likelihood of the model has not been computed). A simple heuristic is used to determine whether the conditional sums of squares estimation is active, if the user does not indicate which approach should be used.The behaviour is controlled via argument
approximation
and the simple heuristic is(length(x)>100 | frequency(x)>12)
, henceapproximation
takes a valueTRUE
if the length of the series is greater than $n = 100$, or there are more than 12 observations within each year. As you simulated series with $n = 500$ but did not specify a value for theapproximation
argument, you ranauto.arima()
withapproximation = TRUE
. This explains the apparently erroneous selection of a model with larger AIC, AICc, and BIC than the simpler model you fitted witharima()
.For your example 1, we should have
Hence
auto.arima()
has selected a more parsimonious model than the true model; an ARIMA(0, 0, 1) is chosen. But this is based on the information criteria and now they are in accordance; the selected model has lower AIC, AICc, and BIC, although the differences for AIC and AICc are small. At least now the selection is consistent with the norms for choosing models based on information criteria.The reason for the MA(1) being chosen, I believe, relates to the first issue I mentioned; namely that the best fitting model to a sample drawn from a stated ARIMA(p, d, q) may not be of the same order as the true model. This is due to random sampling. Taking a longer series or a longer burn in period may help increase the chance that the true model is selected, but don't bank on it.
Regardless, the moral here is that when something looks obviously wrong, like in your question, do read the associated man page or documentation to assure yourself that you understand how the software works.