Solved – Putting less weight on certain data points in a series for forecasting

forecastingoutliersrtime series

I have a data set that contains outliers (big orders) i need to forecast this series taking the outliers into consideration. I already know what the top 11 big orders are so i dont need to detect them first. I have tried a few ways to deal with this 1) forecast the data 10 times each time replacing the biggest outlier with the next biggest until the last set is run with them all replaced and then compare results 2) forecast the data another 10 times removing the outliers in each until they are all removed in the last set. Both of these work but they dont consistently give accurate forecasts. I was wondering if anyone knew another way to approach this?

One way i was thinking was running a weighted ARIMA and work it so that less/minimal weight is put on those specific data points. Is this possible?

I just want to point out that removing the known outliers does not delete that point completely, only minimizes it as there are other deals that happened in that quarter

One of my data sets is the following:

data <- matrix(c("08Q1",    "08Q2", "08Q3", "08Q4", "09Q1", "09Q2", "09Q3", "09Q4", "10Q1", "10Q2", "10Q3", "10Q4", "11Q1", "11Q2", "11Q3", "11Q4", "12Q1", "12Q2", "12Q3", "12Q4", "13Q1", "13Q2", "13Q3", "13Q4", "14Q1", "14Q2", "14Q3", "14Q4",155782698,   159463653.4,    172741125.6,    204547180,  126049319.8,    138648461.5,    135678842.1,    242568446.1,    177019289.3,    200397120.6,    182516217.1,    306143365.6,    222890269.2,    239062450.2,    229124263.2,    370575382.9,    257757410.5,    256125841.6,    231879306.6,    419580274,  268211059,  276378232.1,    261739468.7,    429127062.8,    254776725.6,    329429882.8,    264012891.6,    496745973.9),ncol=2,byrow=FALSE)

the known outliers in this series are:

outliers <- matrix(c("14Q4","14Q2","12Q1","13Q1","14Q2","11Q1","11Q4","14Q2","13Q4","14Q4","13Q1",20193525.68,18319234.7,12896323.62,12718744.01,12353002.09,11936190.13,11356476.28,11351192.31,10101527.85,9723641.25,9643214.018),ncol=2,byrow=FALSE)

please do not say about seasonality as this is only one type of data set, i have many ones without seaonality and i need the code to work for both types.

Edit by javlacalle: This is a plot of the observed data and the time points defined in the first column of outliers.

original data and outliers

Best Answer

The OP insists in dealing with the points that are reported in the question as outliers without considering them as part of a possible seasonal pattern. Below I first give an idea to treat these points separately. In the second part of the answer I propose an alternative approach in the lines of the answer given by @Irishstat, which is a more appropriate analysis of the data.

The effect of these observations can be weighted by means of regression on seasonal dummies (variables that take the value 1 at the time points related to the outliers and 0 otherwise). Then, an ARIMA model for the residuals of the regression could be fitted and used to obtain forecasts.

It may be more efficient to estimate jointly the coefficients for the dummies and those of the ARIMA model, but I did not get a satisfactory result so I decided to split it in two steps as show below.

require(forecast)
x <- ts(as.numeric(data[,2]), frequency = 4, start = c(2008, 1))
outliers <- c(2011.00, 2011.75, 2012.00, 2013.00, 2013.75, 2014.25, 2014.75)
# create dummies
dummies <- matrix(0, nrow = length(x), ncol = length(outliers)) 
for (i in seq_along(outliers))
  dummies[which(time(x) == outliers[i]),i] <- 1
# estimate the weights for these dummies and store the residuals
fitaux <- lm(x ~ dummies)
resid <- residuals(fitaux)
# fit an ARIMA model to the residuals and display forecasts
fit <- auto.arima(resid, ic = "bic")
fcast <- forecast(fit, 8)
# full code of the plot shown below is not posted to save space
plot(fcast)

forecasts of first approach

There is high uncertainty in the forecasts (wide lower and upper bounds). Although not shown, the residuals do not show autocorrelation but there is some sign of overdifferencing. The choice of the ARIMA model should be explored further, but I think this gives you the idea.


As mentioned in the comments above, I don't think the above approach is appropriate. I would do and analysis in the lines of the answer given by Irishstat. The R package tsoutliers follows the approach proposed in Chen and Liu (1993) to detect outliers in time series (e.g. additive outlies, level shifts). This is what I get:

require(tsoutliers)
fit2 <- tso(x, args.tsmethod=list(ic="bic"))
fit2
# ARIMA(0,0,0)(0,1,0)[4] with drift         
# Coefficients:
#         drift        LS4
#       8810020  -64443697
# s.e.  1289215   14293608
# sigma^2 estimated as 5.529e+14:  log likelihood=-366.02
# AIC=738.04   AICc=739.24   BIC=741.57
# Outliers:
#   type ind    time   coefhat  tstat
# 1   LS   4 2008:04 -64443697 -4.509
#
# type plot(fit2) to see the shape of the detected outlier(s)
#
# refit the model with the series adjusted for outliers
# (this will save arrangements to display forecasts
# the same model as in fit2$fit is chosen
fit2 <- auto.arima(fit2$yadj, ic="bic")
plot(forecast(fit2, 8))

forecasts based on second approach

The series is relatively clean from outliers. None of the outliers initially proposed in the question were detected. Similarly to the results shown by Irishstat, the forecasts look now more reliable, since they reflect the overall dynamics of the data.

Related Question