Solved – Outlier detection for generic time series

autocorrelationoutlierstime serieswinsorizing

In this case, "generic" being the entire gauntlet of macroeconomic time-series that private and government statistical offices put out.

Some background – I recently started working at a data provider – we collect data releases and repackage them in a presumably more convenient and accessible fashion for our clients, and we have tens of thousands of data series (wouldn't be surprised if we topped a million, actually). As part of our QA process, we run the following outlier detection:

$X_t-X_{t-1} = E_t$
$\sigma^2$ is estimated from the resulting sample of $E_t$, and a z-score is calculated based off $E_t\sim N(0,\sigma^2)$

I think we can do better – the math clearly falls apart for everything that isn't a random walk.

I initially thought of fitting an ARMA(m,n) based on the peak of the autocorrelation/autocovarience functions of the series and checking the residuals. I'm wary of the robustness of this, and a previous question seems to indicate that autocorrelation is not particularly robust.

Best Answer

You are quite right that the ARIMA Model you are using (first differences) may not be appropriate to detect outliers. Outliers can be Pulses, Level Shifts, Seasonal Pulses or Local Time Trends. You might want to google "INTERVENTION DETECTION IN TIME SERIES" or google "AUTOMATIC INTERVENTION DETECTION" to get some reading matter on INTERVENTION DETECTION. Note that this is not the same as INTERVENTION MODELLING which often assumes the nature of the outlier and does not empirically identify same. Following mpkitas's remarks one would include the empirically identified outliers as dummy predictor series in order to accommodate their impact. A lot of work has been done in identifying oultliers using a null filter and then identifying the appropriate ARIMA Model. Some commercial packages assume that you identify the arima model first ( possibly flawed by the outliers ) and then identify the outliers. More general procedures examine both strategies. Your current procedure follows the "use up front filter first" approach but is also flawed by the assumption of the upfront filter.

Some more reflections: to detect an anomaly, you need a model which provides an expectation. Intervention Detection yields the answer to the question " What is the probability of observing what I observed before I observed it ? AN ARIMA model can then used to identify the "unusual" Time Series observations. The problem is that you can't catch an outlier without a model (at least a mild one) for your data. Else how would you know that a point violated that model? In fact, the process of growing understanding and finding and examining outliers must be iterative. This isn't a new thought. Bacon, writing in Novum Organum about 400 years ago said: "Errors of Nature, Sports and Monsters correct the understanding in regard to ordinary things,and reveal general forms. For whoever knows the ways of Nature will more easily notice her deviations; and, on the other hand, whoever knows her deviations will more accurately understand Nature, The Model you are imposing on all your series i clearly am inadequate way to go.