Solved – How to correct outliers once detected for time series data forecasting

forecastingoutlierstime serieswinsorizing

I'm trying to find a way of correcting outliers once I find/detect them in time series data. Some methods, like nnetar in R, give some errors for time series with big/large outliers. I already managed to correct the missing values, but outliers are still damaging my forecasts…

Best Answer

There is now a facility in the forecast package for R for identifying and replacying outliers. (It also handles the missing values.) As you are apparently already using the forecast package, this might be a convenient solution for you. For example:

fit <- nnetar(tsclean(x))

The tsclean() function will fit a robust trend using loess (for non-seasonal series), or robust trend and seasonal components using STL (for seasonal series). The residuals are computed and the following bounds are computed:

\begin{align} U &= q_{0.9} + 2(q_{0.9}-q_{0.1}) \\ L &= q_{0.1} - 2(q_{0.9}-q_{0.1}) \end{align} where $q_{0.1}$ and $q_{0.9}$ are the 10th and 90th percentiles of the residuals respectively.

Outliers are identified as points with residuals larger than $U$ or smaller than $L$.

For non-seasonal time series, outliers are replaced by linear interpolation. For seasonal time series, the seasonal component from the STL fit is removed and the seasonally adjusted series is linearly interpolated to replace the outliers, before re-seasonalizing the result.

Related Solutions

Solved – STL on time series with missing values for anomaly detection

ARIMA models easily incorporate dummy variables to deal with missing values. These are called Pulse Indicators . The methodology is straightforward and documented in http://www.unc.edu/~jbhill/tsay.pdf. In general the method extracts from the current residual series information regarding Pulses, Level Shifts, Seasonal Pulses and Local Time Trends.

Solved – How to find outliers in a data series

How are you defining "outlier"? Looking at the example plot, I don't see any real outliers. There's just some noise in the data.

However, if you wanted to identify the points that were farthest from the fitted line, that would be fairly straightforward using the predict or residuals functions in the appropriate model. E.g.

x <- 1:100
y <- 3*x + rnorm(100)
m1 <- lm(y~x)
residm1 <- m1$residuals
ranks <- rank(residm1)

You could then select the largest n values for inspection or choose a minimum residual that would qualify as an "outlier".

Best Answer

Related Solutions

Solved – STL on time series with missing values for anomaly detection

Solved – How to find outliers in a data series

Related Question