Solved – Unable to get suitable forecast for ARIMA model in R due to outliers– attached code for easy replication

arimartime series

Using the attached data that has been recently updated I am not able to obtain a statistically significant forecast. The data is extremely seasonal. The data is stored here for easy replication:

http://ge.tt/1uihVfA2/v/0?c

# 1. Make a R timeseries out of the rawdata: specify frequency & startdate
gIIP <- ts(Data, frequency=12, start=c(2003,11))
print(gIIP)
plot.ts(gIIP, type="l", col="blue", ylab="MTD Ships", lwd=2,
        main="Full data")
grid()

Using the auto.arima function I don't need to factor a Box-Cox because the auto.arima factors that into selecting the best model.

Upon "selecting the best model" I The best model suggested was Arima(order = c(0, 0, 1),
seasonal = list(order = c(1, 0, 1), period = 12) with non-zero mean

# 5. Perform estimation
library(forecast)
library(zoo)
library(stats)
auto.arima(gIIP, d=NA, D=NA, max.p=12, max.q=12,
           max.P=2, max.Q=2, max.order=12, max.d=2, max.D=2,
           start.p=2, start.q=2, start.P=1, start.Q=1,
           stationary=FALSE, seasonal=TRUE,
           ic=c("aicc","aic", "bic"), stepwise=FALSE, trace=TRUE,
           approximation=FALSE | frequency(gIIP)>12), xreg=NULL,
           test=c("kpss","adf","pp"), seasonal.test=c("ocsb","ch"),
           allowdrift=TRUE, lambda=TRUE, parallel=FALSE, num.cores=4

)

then proceed to conduct accuracy diagnostics but unable to obtain any output.

#Check standard error etc of "fitted" ARIMA
pos.arima <- function(gIIP, order = c(0, 0, 1),
      seasonal = list(order = c(1, 0, 1), period = 12),
      xreg = NULL, include.drift=TRUE, 
      transform.pars = TRUE,
      fixed = NULL, init = NULL,
      method = c("CSS-ML", "ML", "CSS"), 
      optim.method = "BFGS",
      optim.control = list(), kappa = 1e6)

acf(pos.arima) 
pacf(pos.arima)

The following step to conduct an ex ante (out of sample forecast) but also unable to obtain a statistically significant forecast—forecast with lowest standard error rate. I tested this by removing the last 5 observations to test the model.

# 7. Forecast Out-Of-Sample ---this used to work
fit <- Arima(gIIP, order = c(0, 0, 1), seasonal = list(order = c(1, 0, 1), period = 12),
             xreg = TRUE, include.mean = TRUE, transform.pars = TRUE, 
             fixed = NULL, init = NULL, method = c("CSS-ML", "ML", "CSS"), 
             optim.method = "BFGS", optim.control = list(), kappa = 1e6)
plot(forecast(fit,h=9))
print(forecast(fit,h=9))

Used to obtain output here. Can you please help me diagnose why there ARIMA model is not working like it once did for me? Thank you for your time.

Best Answer

The basic problem in your data is that it has outliers, not treating them are at least understanding what/where the outliers are present might lead to poor predictions.

There is a package in R called tsoutliers which implements Chen and Liu that can help you diagnose outliers in the data. Commercial packages such as Autobox which uses Tsay's outlier detection also has excellent outlier detection capabilities. I'll expand my answer in the next few days bear with me.

I used tso function in tsoutliers package to detect outliers

datats <- ts(data,start=c(2003,11),frequency=12)
plot.ts(datats)

c  <- tso(datats, types = c("AO", "LS","SLS"))
plot(c)

Below are the outputs:

Outliers:
  type ind    time  coefhat tstat
1  SLS  18 2005:04 16923128 6.765
2   AO 101 2012:03 36590158 4.763
3  SLS 112 2013:02 21989974 4.096
4  SLS 113 2013:03 25225304 4.699
5   AO 115 2013:05 24259786 3.158

In looking at your data, there is an additive outlier at observation 2012:03 and seasonal level shift around 2013:02. You can practically ignore seasonal level shift at 2005:04. tsoutliers provides nice graphical output that shows some instability in the last few years 2012/2013/2014, that is seasonal variation has changed. If you do not account for it,then you are bound to produce poor forecasts.

enter image description here

Related Solutions

Solved – Daily Time Series Analysis

Your ACF and PACF indicate that you at least have weekly seasonality, which is shown by the peaks at lags 7, 14, 21 and so forth.

You may also have yearly seasonality, although it's not obvious from your time series.

Your best bet, given potentially multiple seasonalities, may be a tbats model, which explicitly models multiple types of seasonality. Load the forecast package:

library(forecast)

Your output from str(x) indicates that x does not yet carry information about potentially having multiple seasonalities. Look at ?tbats, and compare the output of str(taylor). Assign the seasonalities:

x.msts <- msts(x,seasonal.periods=c(7,365.25))

Now you can fit a tbats model. (Be patient, this may take a while.)

model <- tbats(x.msts)

Finally, you can forecast and plot:

plot(forecast(model,h=100))

You should not use arima() or auto.arima(), since these can only handle a single type of seasonality: either weekly or yearly. Don't ask me what auto.arima() would do on your data. It may pick one of the seasonalities, or it may disregard them altogether.

EDIT to answer additional questions from a comment:

How can I check whether the data has a yearly seasonality or not? Can I create another series of total number of events per month and use its ACF to decide this?

Calculating a model on monthly data might be a possibility. Then you could, e.g., compare AICs between models with and without seasonality.

However, I'd rather use a holdout sample to assess forecasting models. Hold out the last 100 data points. Fit a model with yearly and weekly seasonality to the rest of the data (like above), then fit one with only weekly seasonality, e.g., using auto.arima() on a ts with frequency=7. Forecast using both models into the holdout period. Check which one has a lower error, using MAE, MSE or whatever is most relevant to your loss function. If there is little difference between errors, go with the simpler model; otherwise, use the one with the lower error.

The proof of the pudding is in the eating, and the proof of the time series model is in the forecasting.

To improve matters, don't use a single holdout sample (which may be misleading, given the uptick at the end of your series), but use rolling origin forecasts, which is also known as "time series cross-validation". (I very much recommend that entire free online forecasting textbook.

So Seasonal ARIMA models cannot usually handle multiple seasonalities? Is it a property of the model itself or is it just the way the functions in R are written?

Standard ARIMA models handle seasonality by seasonal differencing. For seasonal monthly data, you would not model the raw time series, but the time series of differences between March 2015 and March 2014, between February 2015 and February 2014 and so forth. (To get forecasts on the original scale, you'd of course need to undifference again.)

There is no immediately obvious way to extend this idea to multiple seasonalities.

Of course, you can do something using ARIMAX, e.g., by including monthly dummies to model the yearly seasonality, then model residuals using weekly seasonal ARIMA. If you want to do this in R, use ts(x,frequency=7), create a matrix of monthly dummies and feed that into the xreg parameter of auto.arima().

I don't recall any publication that specifically extends ARIMA to multiple seasonalities, although I'm sure somebody has done something along the lines in my previous paragraph.

Solved – time series forecasting using auto.arima and exponential smoothing

Seasonality is probably not very strong. Different algorithms will give different results, unless seasonality is glaringly obvious.
The best measure is always to compare forecast accuracy on a holdout set: hold back the last $n$ observations, fit your models to all other observations, forecast into the last $n$ time periods with both models, then compare forecast accuracy using your error measure of choice (see 5 below).
Yes, this is a common complaint. I don't think there is an easy way to get the in-sample fit. But you can get the residuals: auto.arima(WWWusage)$residuals. Best to look into the code of auto.arima() to see whether you need to add or subtract them from the original series to get the fit. I'd say you have to subtract ("actuals=model+residuals"), but better check.
I recommend a good forecasting textbook. This is a very good start. Otherwise, read through the help pages.
The appropriate error measure will depend on your personal loss function. Is your pain symmetric, and will it increase more strongly with larger errors? Then use MSE. Is your pain proportional to absolute errors? Then use MAE. Best to look at multiple error measures.

One tip: averaging forecasts will usually improve accuracy. Consider taking the average of your two models' forecasts per future time bucket.
auto.arima() apparently fits no drift, even if you allow it.

Best Answer

Related Solutions

Solved – Daily Time Series Analysis

Solved – time series forecasting using auto.arima and exponential smoothing

Related Question