auto.arima is an automatic arima modeling function in forecast
package in R
that uses information criterion(example: AIC/BIC) to select best ARIMA model. I was attempting to answer a question in this site. I used the data series from the above referenced question and also can be downloaded here http://ge.tt/1uihVfA2/v/0?c. The plot is shown below.
Eyeballing the series I find that there is an outlier at 2012:03 or observation 101. So I used tsoutlier package to create an additive outlier, a binary coded variable for the series (135 obs).
Below is the code for reproduciblity.
datats <- ts(data,start=c(2003,11),frequency=12)
plot.ts(datats)
## Create data frame for outlier (pulse) to use in xreg in auto arima at obs 101
out.ind <- outliers(c("AO"), c(101))
out.df <- outliers.effects(out.ind, 135)
## Model using arima
ar.m <- auto.arima(datats,xreg=out.df)
When I apply auto.arima to the outlier adjusted series using xreg, I get following results.
Series: datats
ARIMA(0,0,0) with non-zero mean
Coefficients:
intercept AO101
17253973.7 34441489
s.e. 842995.3 10137415
sigma^2 estimated as 9.503e+13: log likelihood=-2364.06
AIC=4734.12 AICc=4734.3 BIC=4742.84
I was really surprised by the above results, clearly the data is seasonal and has structure in it, however the auto.arima chose "no model". This seems to be a poor model selection.
Below are my questions:
- Am I doing something incorrect ? I'm assuming I did something egregiously incorrect here. May be someone can correct my flaw in the modeling process. If not, then I have 2 more questions.
- Is auto.arima algorithm unable to handle binary coded variable for outlier correction
- Is this is a flaw/limitation in using AIC (Information criteria in general) for ARIMA model selection for time series with external regressors.
Best Answer
The issue is most likely related to the scale of the data. The data take on large values and this may lead to some numerical problems. This is what I get after dividing the data by 10,000 and using the BIC criterion.
The chosen model contains two seasonal AR coefficients that capture cycles related to the fundamental seasonal frequency and some of its harmonics.
Note: I don't claim the above is the best model for the data, it was just intended to illustrate a possible problem with the scale of the data.
Edit
Be also aware that by default for series with more than 100 observations, estimation is done by conditional sums of squares and the information criteria used for model selection are approximated (by default
approximation=TRUE
). This issue along with the large scale of the data may be troublesome in this case.The plot below shows the forecasts for each chosen model respectively for the following cases:
approximation=TRUE
is used (as in your example),approximation=FALSE
,approximation=TRUE
.Either rescaling the series or using
approximation=FALSE
yields graphically a sensible result compared to the original scale and the default options.