Outliers Spotting – Should Pre-Process Data in Time Series Analysis?

data miningoutliersrtime series

My question builds on a previous post on outlier detection in generic time series, and specifically on the answer provided by the always great Rob H.

I work for a small-sized manufacturing company that currently handles the issue, i.e. detecting outliers in sales data time series, employing a (dubious) automated off-the-shelf software procedure.

I think this kind of approach is questionable at best and, more often than not, I'm not happy with the results I get. I would therefore like to "double check" the output from our software using some alternative method.

Rob's idea seemed reasonable, straightforward and easy to implement, so I decided to give it a try. Question is: what if my time series are not "generic"?

Stl decomposition highlights a strong seasonality and a varying trend in my data:

enter image description here

(BTW I used stl(x,s.window="periodic") like Rob suggested, but IMHO stl(x,s.window="periodic",robust=TRUE) would be a better choice since outlier detection is the issue at hand here. Also I'm not really sure about the s.window="periodic" part, I tried experimenting with different values a bit, but I don't know how to interpret results. Maybe someone can point me in the right direction?).

Back to my question, mine being sales data, the seasonal pattern is (or I think it is) strongly affected by calendar effects. Also I have reason to believe the big level shift in 2009 is due to the financial crisis and it has nothing to do with trend.

What do I do here? Should I let the model handle this, or should I pre-process data? Do I perform working-day adjustment and re-allign (is there such a thing?) before-2009 and after-2009 data, or do I let STL decomposition do the work?

I could write another 1000 lines, but I think this should be enough to get the message through. I apologize for the WOT and for my bad english. Also I hope I did not break too many forum rules…

I hope someone out there can help!

Best Answer

  • The smooth trend should cope with economic effects without any trouble.
  • Using robust=TRUE in stl makes sense here (and I've changed my original function to do the same).
  • Unless you have more than ten years of data, I would stick with periodic seasonality. It is unlikely to change fast enough to detect with shorter time series.
  • Pre-processing the data for working days makes sense as it removes known causes of variability.

I suggest you try the stl approach and look at where it gives very different results from your existing method. Then look at those cases and see which method is giving the most sensible results.

I would not go the ARIMA route as it is nowhere near as robust as stl.

Related Question