Solved – Issues in auto.arima algorithm when using external regressors and outlier correction

arimaforecastingoutlierstime series

auto.arima is an automatic arima modeling function in forecast package in R that uses information criterion(example: AIC/BIC) to select best ARIMA model. I was attempting to answer a question in this site. I used the data series from the above referenced question and also can be downloaded here http://ge.tt/1uihVfA2/v/0?c. The plot is shown below.

Eyeballing the series I find that there is an outlier at 2012:03 or observation 101. So I used tsoutlier package to create an additive outlier, a binary coded variable for the series (135 obs).

enter image description here

Below is the code for reproduciblity.

datats <- ts(data,start=c(2003,11),frequency=12)
plot.ts(datats)

## Create data frame for outlier (pulse) to use in xreg in auto arima at obs 101

out.ind <- outliers(c("AO"), c(101))
out.df <- outliers.effects(out.ind, 135)


## Model using arima
ar.m <- auto.arima(datats,xreg=out.df)

When I apply auto.arima to the outlier adjusted series using xreg, I get following results.

Series: datats 
ARIMA(0,0,0) with non-zero mean 

Coefficients:
       intercept     AO101
      17253973.7  34441489
s.e.    842995.3  10137415

sigma^2 estimated as 9.503e+13:  log likelihood=-2364.06
AIC=4734.12   AICc=4734.3   BIC=4742.84

I was really surprised by the above results, clearly the data is seasonal and has structure in it, however the auto.arima chose "no model". This seems to be a poor model selection.

Below are my questions:

Am I doing something incorrect ? I'm assuming I did something egregiously incorrect here. May be someone can correct my flaw in the modeling process. If not, then I have 2 more questions.
Is auto.arima algorithm unable to handle binary coded variable for outlier correction
Is this is a flaw/limitation in using AIC (Information criteria in general) for ARIMA model selection for time series with external regressors.

Best Answer

The issue is most likely related to the scale of the data. The data take on large values and this may lead to some numerical problems. This is what I get after dividing the data by 10,000 and using the BIC criterion.

require(forecast)
require(tsoutliers)
datats <- structure(c(2948440.74, 7686353.79, 6911493.29, 14735151.86, 23381406.69, 20227482.18, 18903360.7, 15833633.89, 13124710.08, 
  10021578.4, 10543615.37, 5101192.99, 4298189.2, 4003772.82, 5575776.76, 21282851.11, 22256756.35, 33951754.02, 25523787.6, 19104589.15, 
  16589269.34, 22852357.49, 16193397.83, 12323810.99, 10452747.36, 4987688.69, 8046769.04, 29435984.88, 24839254.81, 29105406.04, 
  31800705.19, 13817643.19, 13814943.1, 26481714.12, 17992971.92, 9399541.88, 6543970.31, 5310201.59, 7570275.84, 25988976.96, 
  30725584.28, 27390701.17, 25820703.75, 17954952.49, 15088122.36, 15702960.91, 13609359.16, 7369414.86, 4775041.56, 4034859.17, 
  17040576.77, 16826494.46, 22406996.07, 24773844.66, 31287138.55, 23570837.57, 16026064.28, 15197182.57, 15509137.94, 10241087.07, 
  3493348.32, 1730346.2, 17251774.11, 20148964.98, 19003322.88, 29760265.93, 25481326.36, 25830736.11, 14148186.89, 14368004.55, 
  14952315.95, 6886217.54, 7295155.27, 5944267.08, 14806881.21, 25824636.63, 26767998.82, 38040351.59, 29168699.84, 29433400.4, 
  20307430.63, 16010131.61, 12446879.82, 4818262.85, 3745592.95, 8864976.31, 12814904.34, 30240664.47, 34272201.49, 31058565.62, 
  18227566.3, 17786236.84, 21470722, 16643487.78, 11593014.72, 7215432.3, 3041258.29, 4808895.81, 13848873.64, 20426958.31, 
  51695463.16, 38293297.51, 25750299.14, 19245949.8, 14828108.86, 15736758.69, 11288184.22, 8509989.64, 6560242.4, 4625940.53, 
  22368905.29, 34677860.01, 38790083.19, 33567845.48, 39365091.23, 26318906.99, 20814308.33, 16424344.98, 12170515.08, 7406164.25, 
  4780528.38, 6320645.22, 13635055.52, 39512697.88, 41871133.48, 34342300.49, 27125796.06, 16253723.1, 20046817.98, 13785445.02, 
  14315921.47, 7451878.2, 5707949.58, 3884183.86, 11962756.46), .Tsp = c(2003.83333333333, 2015, 12), class = "ts")
out.ind <- outliers(c("AO"), c(101))
out.df <- outliers.effects(out.ind, 135)
x <- datats/100000
auto.arima(x, ic = "bic", xreg=out.df)
# ARIMA(1,0,0)(2,0,0)[12] with non-zero mean 
# Coefficients:
#          ar1    sar1    sar2  intercept     AO101
#       0.3900  0.6898  0.1101   168.5335  179.8997
# s.e.  0.0925  0.0910  0.0995    25.2476   37.3177
# sigma^2 estimated as 2328:  log likelihood=-720.61
# AIC=1453.22   AICc=1453.88   BIC=1470.66

The chosen model contains two seasonal AR coefficients that capture cycles related to the fundamental seasonal frequency and some of its harmonics.

Note: I don't claim the above is the best model for the data, it was just intended to illustrate a possible problem with the scale of the data.

Edit

Be also aware that by default for series with more than 100 observations, estimation is done by conditional sums of squares and the information criteria used for model selection are approximated (by default approximation=TRUE). This issue along with the large scale of the data may be troublesome in this case.

The plot below shows the forecasts for each chosen model respectively for the following cases:

the original scale and approximation=TRUE is used (as in your example),
the original scale and approximation=FALSE,
the original series divided by 10,000 and approximation=TRUE.

Either rescaling the series or using approximation=FALSE yields graphically a sensible result compared to the original scale and the default options.

fit1 <- auto.arima(datats, ic = "bic", xreg=out.df, approximation=TRUE)
fit2 <- auto.arima(datats, ic = "bic", xreg=out.df, approximation=FALSE)
fit3 <- auto.arima(datats/10000, ic = "bic", xreg=out.df, approximation=TRUE)
par(mfrow=c(3,1), mar=c(2,2,2,2))
plot(forecast(fit1, 24, xreg=rep(0,24)))
plot(forecast(fit2, 24, xreg=rep(0,24)))
plot(forecast(fit3, 24, xreg=rep(0,24)))

comparison of the three models

Related Solutions

Time Series – Detecting Outliers in Time Series (LS/AO/TC) Using tsoutliers Package in R. How to Represent Outliers in Equation Format?

The temporary change, TC, is a general type of outlier. The equation given in the documentation of the package and that you wrote is the equation that describes the dynamics of this type of outlier. You can generate it by means of the function filter as shown below. It is illuminating to display it for several values of delta. For $\delta=0$ the TC collapses in an additive outlier; on the other extreme, $\delta=1$, the TC is like a level shift.

tc <- rep(0, 50)
tc[20] <- 1
tc1 <- filter(tc, filter = 0, method = "recursive")
tc2 <- filter(tc, filter = 0.3, method = "recursive")
tc3 <- filter(tc, filter = 0.7, method = "recursive")
tc4 <- filter(tc, filter = 1, method = "recursive")
par(mfrow = c(2,2))
plot(tc1, main = "TC delta = 0")
plot(tc2, main = "TC delta = 0.3")
plot(tc3, main = "TC delta = 0.7")
plot(tc4, main = "TC delta = 1", type = "s")

tremporary change

In your example, you can use the function outliers.effects to represent the effects of the detected outliers on the observed series:

# unit impulse
m1 <- ts(outliers.effects(outlier.chicken$outliers, n = length(chicken), weights = FALSE))
tsp(m1) <- tsp(chicken)
# weighted by the estimated coefficients
m2 <- ts(outliers.effects(outlier.chicken$outliers, n = length(chicken), weights = TRUE))
tsp(m2) <- tsp(chicken)

The innovational outlier, IO, is more peculiar. Contrary to the other types of outliers considered in tsoutliers, the effect of the IO depends on the selected model and on the parameter estimates. This fact can be troublesome in series with many outliers. In the first iterations of the algorithm (where the effect of some of the outliers may not have been detected and adjusted) the quality of the estimates of the ARIMA model may not be good enough as to accurately define the IO. Moreover, as the algorithm makes progress a new ARIMA model may be selected. Thus, it is possible to detect an IO at a preliminary stage with an ARIMA model but eventually its dynamic is defined by another ARIMA model chosen in the last stage.

In this document (1) it is shown that, in some circumstances, the influence of an IO may increase as the date of its occurrence becomes more distant into the past, which is something hard to interpret or assume.

The IO has an interesting potential since it may capture seasonal outliers. The other types of outliers considered in tsoutlierscannot capture seasonal patterns. Nevertheless, in some cases it may be better to search for a possible seasonal level shifts, SLS, instead of IO (as shown in the document mentioned before).

The IO has an appealing interpretation. It is sometimes understood as an additive outlier that affects the disturbance term and then propagates in the series according to the dynamic of the ARIMA model. In this sense, the IO is like an additive outlier, both of them affect a single observation but the IO is an impulse in the disturbance term while the AO is an impulse added directly to the values generated by the ARIMA model or the data generating process. Whether outliers affect the innovations or are outside the disturbance term may be a matter of discussion.

In the previous reference you may find some examples of real data where IO are detected.

(1) Seasonal outliers in time series. Regina Kaiser and Agustín Maravall. Document 20.II.2001.

Solved – time series forecasting using auto.arima and exponential smoothing

Seasonality is probably not very strong. Different algorithms will give different results, unless seasonality is glaringly obvious.
The best measure is always to compare forecast accuracy on a holdout set: hold back the last $n$ observations, fit your models to all other observations, forecast into the last $n$ time periods with both models, then compare forecast accuracy using your error measure of choice (see 5 below).
Yes, this is a common complaint. I don't think there is an easy way to get the in-sample fit. But you can get the residuals: auto.arima(WWWusage)$residuals. Best to look into the code of auto.arima() to see whether you need to add or subtract them from the original series to get the fit. I'd say you have to subtract ("actuals=model+residuals"), but better check.
I recommend a good forecasting textbook. This is a very good start. Otherwise, read through the help pages.
The appropriate error measure will depend on your personal loss function. Is your pain symmetric, and will it increase more strongly with larger errors? Then use MSE. Is your pain proportional to absolute errors? Then use MAE. Best to look at multiple error measures.

One tip: averaging forecasts will usually improve accuracy. Consider taking the average of your two models' forecasts per future time bucket.
auto.arima() apparently fits no drift, even if you allow it.

Best Answer

Related Solutions

Time Series – Detecting Outliers in Time Series (LS/AO/TC) Using tsoutliers Package in R. How to Represent Outliers in Equation Format?

Solved – time series forecasting using auto.arima and exponential smoothing

Related Question