Solved – Outlier detection in seasonal time series via forecasting with ARIMA model

arimaforecastingoutliersseasonalitytime series

I am trying to find Outliers in a contextual time series data using ARIMA model.

My data contains the Hourly Average Speed and Volume of vehicle traffic for two months. So, day 1 (say Monday) would have 24 observations and so on.
There is seasonality in my data : Volume is really high at 08:00 and again high at 17:00 on the weekdays.

So, how I planned to do it was, take a Monday that has no outliers, use that as an input the the ARIMA model to predict the next 24 hours. And compare this prediction with the other Mondays that I have to find outliers in those Mondays that I compare. Similarly take a Tuesday that has no outliers, input that to ARIMA to predict the next 24 hours and compare that prediction to the Tuesdays I have. Do the same for Wednesdays to Sundays. Also, take holidays (since they have different vehicle count to that of working days) and compare the forecast of a holiday with the other holidays I have. [I am not sure if it makes sense but that is what I have in mind].

Will my idea work or are there some other suggestions?

And, how would I use ARIMA for hourly data?

Data in CSV format [Month – 1 and 2]

There are about 2 x 24 x 30 values in total 

Best Answer

I took your 61 consecutive days of data (24 hourly readings per day):

enter image description here

The 1464 values were not analyzed in one model because there were essentially 24 sets of 61 historical values piggy-backed together. ACF/PACF analysis makes little sense as for example. the 25th value has little or nothing to do with the 24th value etc. thus temporal autocorrelation is non-informative. The fact that the 24th value might be related to the 48th value is much more interesting or even the 192nd value (1 week later).

I analyzed it using AUTOBOX a piece of software that I have helped develop using a 30-day forecast horizon. The documentation for the approach can be found in the User Guide available from the AFS website. I will try and give you a general overview here and tie it into your intentions to want to use daily totals (and forecasts of same) to guide hourly forecasts. Mixed seasonal problems like yours can easily be mis-modelled thus over many years of forecasting 15-minute intervals, hourly intervals etc data we have formed an integrated solution. As an overview, we use daily aggregates as a predictor to hourly values thus in this case some 25 models were developed. Ultimately the 24 hourly forecasts are reconciled with the daily forecasts yielding a final solution to this "thorny problem".

The data is analyzed in a parent-to-child approach where a model is initially developed for the daily totals and here incorporating memory, daily effects and anomalies, level shifts, local time trends etc... This "parent model" leads to forecasts and confidence limits based upon possible daily totals. Note that day 3 is a Saturday and day 4 is a Sunday showing significantly lower values for Volume ( shown as X1 and X2 in the equation):

enter image description here

enter image description here

The next step is to identify 24 causal models using the parent i.e. the daily total and its forecasts as a possible predictor using memory as needed, level shifts as needed while identifying and remedying possible anomalies/level shifts/local time trends. As an example of this let me show you the graphical output for hour 12 enter image description here enter image description here . Now we have 24 sets of forecasts for the children for the next 30 days and a set of forecasts for the parent for the next 308 days. We reconcile these two to obtain the final forecasts for 24 hours for the 30-day forecast.

The reconciliation can be done in a parent-to-child or a child-to-parent manner. Clearly, 61 days of data are insufficient to capture, weekly effects, holiday effects, specific days of the month effects, long-weekend effects, week-in-month effects, monthly effects etc.. but if you had a longer series you can get the picture as to what might be possible. The detailed output can be made available to you or any interested party by contacting me as it is too voluminous to post. With this analysis, one might want to creatively program the approach with free R tools but there are a lot of pitfalls awaiting such enterprise. Hope this helps your research into what I think is a very important statistical problem regarding detecting exceptional events and accounting for their effect on the forecast horizon.

I have made similar responses in the past .. here where hour is a subgroup Robust time-series regression for outlier detection and here for product/class sub-group analyses Forecasting Amazon or Netflix demand

It is important to note that outliers reflect the impact of exceptional exogenous activity GIVEN other factors thus the importance of correctly encoding daily effects i.e. the kind of day. Longer time series would reveal holiday effects and other "causals".