Solved – Issues in auto.arima algorithm when using external regressors and outlier correction

arimaforecastingoutlierstime series

auto.arima is an automatic arima modeling function in forecast package in R that uses information criterion(example: AIC/BIC) to select best ARIMA model. I was attempting to answer a question in this site. I used the data series from the above referenced question and also can be downloaded here http://ge.tt/1uihVfA2/v/0?c. The plot is shown below.

Eyeballing the series I find that there is an outlier at 2012:03 or observation 101. So I used tsoutlier package to create an additive outlier, a binary coded variable for the series (135 obs).

enter image description here

Below is the code for reproduciblity.

datats <- ts(data,start=c(2003,11),frequency=12)
plot.ts(datats)

## Create data frame for outlier (pulse) to use in xreg in auto arima at obs 101

out.ind <- outliers(c("AO"), c(101))
out.df <- outliers.effects(out.ind, 135)


## Model using arima
ar.m <- auto.arima(datats,xreg=out.df)

When I apply auto.arima to the outlier adjusted series using xreg, I get following results.

Series: datats 
ARIMA(0,0,0) with non-zero mean 

Coefficients:
       intercept     AO101
      17253973.7  34441489
s.e.    842995.3  10137415

sigma^2 estimated as 9.503e+13:  log likelihood=-2364.06
AIC=4734.12   AICc=4734.3   BIC=4742.84

I was really surprised by the above results, clearly the data is seasonal and has structure in it, however the auto.arima chose "no model". This seems to be a poor model selection.

Below are my questions:

  1. Am I doing something incorrect ? I'm assuming I did something egregiously incorrect here. May be someone can correct my flaw in the modeling process. If not, then I have 2 more questions.
  2. Is auto.arima algorithm unable to handle binary coded variable for outlier correction
  3. Is this is a flaw/limitation in using AIC (Information criteria in general) for ARIMA model selection for time series with external regressors.

Best Answer

The issue is most likely related to the scale of the data. The data take on large values and this may lead to some numerical problems. This is what I get after dividing the data by 10,000 and using the BIC criterion.

require(forecast)
require(tsoutliers)
datats <- structure(c(2948440.74, 7686353.79, 6911493.29, 14735151.86, 23381406.69, 20227482.18, 18903360.7, 15833633.89, 13124710.08, 
  10021578.4, 10543615.37, 5101192.99, 4298189.2, 4003772.82, 5575776.76, 21282851.11, 22256756.35, 33951754.02, 25523787.6, 19104589.15, 
  16589269.34, 22852357.49, 16193397.83, 12323810.99, 10452747.36, 4987688.69, 8046769.04, 29435984.88, 24839254.81, 29105406.04, 
  31800705.19, 13817643.19, 13814943.1, 26481714.12, 17992971.92, 9399541.88, 6543970.31, 5310201.59, 7570275.84, 25988976.96, 
  30725584.28, 27390701.17, 25820703.75, 17954952.49, 15088122.36, 15702960.91, 13609359.16, 7369414.86, 4775041.56, 4034859.17, 
  17040576.77, 16826494.46, 22406996.07, 24773844.66, 31287138.55, 23570837.57, 16026064.28, 15197182.57, 15509137.94, 10241087.07, 
  3493348.32, 1730346.2, 17251774.11, 20148964.98, 19003322.88, 29760265.93, 25481326.36, 25830736.11, 14148186.89, 14368004.55, 
  14952315.95, 6886217.54, 7295155.27, 5944267.08, 14806881.21, 25824636.63, 26767998.82, 38040351.59, 29168699.84, 29433400.4, 
  20307430.63, 16010131.61, 12446879.82, 4818262.85, 3745592.95, 8864976.31, 12814904.34, 30240664.47, 34272201.49, 31058565.62, 
  18227566.3, 17786236.84, 21470722, 16643487.78, 11593014.72, 7215432.3, 3041258.29, 4808895.81, 13848873.64, 20426958.31, 
  51695463.16, 38293297.51, 25750299.14, 19245949.8, 14828108.86, 15736758.69, 11288184.22, 8509989.64, 6560242.4, 4625940.53, 
  22368905.29, 34677860.01, 38790083.19, 33567845.48, 39365091.23, 26318906.99, 20814308.33, 16424344.98, 12170515.08, 7406164.25, 
  4780528.38, 6320645.22, 13635055.52, 39512697.88, 41871133.48, 34342300.49, 27125796.06, 16253723.1, 20046817.98, 13785445.02, 
  14315921.47, 7451878.2, 5707949.58, 3884183.86, 11962756.46), .Tsp = c(2003.83333333333, 2015, 12), class = "ts")
out.ind <- outliers(c("AO"), c(101))
out.df <- outliers.effects(out.ind, 135)
x <- datats/100000
auto.arima(x, ic = "bic", xreg=out.df)
# ARIMA(1,0,0)(2,0,0)[12] with non-zero mean 
# Coefficients:
#          ar1    sar1    sar2  intercept     AO101
#       0.3900  0.6898  0.1101   168.5335  179.8997
# s.e.  0.0925  0.0910  0.0995    25.2476   37.3177
# sigma^2 estimated as 2328:  log likelihood=-720.61
# AIC=1453.22   AICc=1453.88   BIC=1470.66

The chosen model contains two seasonal AR coefficients that capture cycles related to the fundamental seasonal frequency and some of its harmonics.

Note: I don't claim the above is the best model for the data, it was just intended to illustrate a possible problem with the scale of the data.

Edit

Be also aware that by default for series with more than 100 observations, estimation is done by conditional sums of squares and the information criteria used for model selection are approximated (by default approximation=TRUE). This issue along with the large scale of the data may be troublesome in this case.

The plot below shows the forecasts for each chosen model respectively for the following cases:

  1. the original scale and approximation=TRUE is used (as in your example),
  2. the original scale and approximation=FALSE,
  3. the original series divided by 10,000 and approximation=TRUE.

Either rescaling the series or using approximation=FALSE yields graphically a sensible result compared to the original scale and the default options.

fit1 <- auto.arima(datats, ic = "bic", xreg=out.df, approximation=TRUE)
fit2 <- auto.arima(datats, ic = "bic", xreg=out.df, approximation=FALSE)
fit3 <- auto.arima(datats/10000, ic = "bic", xreg=out.df, approximation=TRUE)
par(mfrow=c(3,1), mar=c(2,2,2,2))
plot(forecast(fit1, 24, xreg=rep(0,24)))
plot(forecast(fit2, 24, xreg=rep(0,24)))
plot(forecast(fit3, 24, xreg=rep(0,24)))

comparison of the three models