Solved – Outlier detection for generic time series

autocorrelationoutlierstime serieswinsorizing

In this case, "generic" being the entire gauntlet of macroeconomic time-series that private and government statistical offices put out.

Some background – I recently started working at a data provider – we collect data releases and repackage them in a presumably more convenient and accessible fashion for our clients, and we have tens of thousands of data series (wouldn't be surprised if we topped a million, actually). As part of our QA process, we run the following outlier detection:

$X_t-X_{t-1} = E_t$
$\sigma^2$ is estimated from the resulting sample of $E_t$, and a z-score is calculated based off $E_t\sim N(0,\sigma^2)$

I think we can do better – the math clearly falls apart for everything that isn't a random walk.

I initially thought of fitting an ARMA(m,n) based on the peak of the autocorrelation/autocovarience functions of the series and checking the residuals. I'm wary of the robustness of this, and a previous question seems to indicate that autocorrelation is not particularly robust.

Best Answer

You are quite right that the ARIMA Model you are using (first differences) may not be appropriate to detect outliers. Outliers can be Pulses, Level Shifts, Seasonal Pulses or Local Time Trends. You might want to google "INTERVENTION DETECTION IN TIME SERIES" or google "AUTOMATIC INTERVENTION DETECTION" to get some reading matter on INTERVENTION DETECTION. Note that this is not the same as INTERVENTION MODELLING which often assumes the nature of the outlier and does not empirically identify same. Following mpkitas's remarks one would include the empirically identified outliers as dummy predictor series in order to accommodate their impact. A lot of work has been done in identifying oultliers using a null filter and then identifying the appropriate ARIMA Model. Some commercial packages assume that you identify the arima model first ( possibly flawed by the outliers ) and then identify the outliers. More general procedures examine both strategies. Your current procedure follows the "use up front filter first" approach but is also flawed by the assumption of the upfront filter.

Some more reflections: to detect an anomaly, you need a model which provides an expectation. Intervention Detection yields the answer to the question " What is the probability of observing what I observed before I observed it ? AN ARIMA model can then used to identify the "unusual" Time Series observations. The problem is that you can't catch an outlier without a model (at least a mild one) for your data. Else how would you know that a point violated that model? In fact, the process of growing understanding and finding and examining outliers must be iterative. This isn't a new thought. Bacon, writing in Novum Organum about 400 years ago said: "Errors of Nature, Sports and Monsters correct the understanding in regard to ordinary things,and reveal general forms. For whoever knows the ways of Nature will more easily notice her deviations; and, on the other hand, whoever knows her deviations will more accurately understand Nature, The Model you are imposing on all your series i clearly am inadequate way to go.

Related Solutions

Solved – Simple algorithm for online outlier detection of a generic time series

Here is a simple R function that will find time series outliers (and optionally show them in a plot). It will handle seasonal and non-seasonal time series. The basic idea is to find robust estimates of the trend and seasonal components and subtract them. Then find outliers in the residuals. The test for residual outliers is the same as for the standard boxplot -- points greater than 1.5IQR above or below the upper and lower quartiles are assumed outliers. The number of IQRs above/below these thresholds is returned as an outlier "score". So the score can be any positive number, and will be zero for non-outliers.

I realise you are not implementing this in R, but I often find an R function a good place to start. Then the task is to translate this into whatever language is required.

tsoutliers <- function(x,plot=FALSE)
{
    x <- as.ts(x)
    if(frequency(x)>1)
        resid <- stl(x,s.window="periodic",robust=TRUE)$time.series[,3]
    else
    {
        tt <- 1:length(x)
        resid <- residuals(loess(x ~ tt))
    }
    resid.q <- quantile(resid,prob=c(0.25,0.75))
    iqr <- diff(resid.q)
    limits <- resid.q + 1.5*iqr*c(-1,1)
    score <- abs(pmin((resid-limits[1])/iqr,0) + pmax((resid - limits[2])/iqr,0))
    if(plot)
    {
        plot(x)
        x2 <- ts(rep(NA,length(x)))
        x2[score>0] <- x[score>0]
        tsp(x2) <- tsp(x)
        points(x2,pch=19,col="red")
        return(invisible(score))
    }
    else
        return(score)
}

Solved – Bonferroni for outlier detection

Try generating some data from a normal distribution, first generate a small sample and look at the spread of the points, now add a few more points, then more, then more. You will notice that as the sample size gets bigger you will see more extreme values (potential outliers) just by chance alone. If you don't do some adjustment for multiple comparisons then you will see much more significance in large sample sizes just due to the large sample size when the underlying process is stable and all the data points are legitimate (inliers?).

Best Answer

Related Solutions

Solved – Simple algorithm for online outlier detection of a generic time series

Solved – Bonferroni for outlier detection

Related Question