Solved – How to find outliers in a data series

data miningoutliers

I have a series of 100 points

My dataset can be found here . Each row is a data series. The plot for 90th row is

enter image description here

It's easy to detect outliers visually by plotting example. I tried using hampel to find outliers assuming it as time series.

x <- read.table("anomaly_s57.dat")
data <- as.matrix(x)

plot_hampel = function(row, k = 2, t = 3) {

    plot.ts(data[row,])
    hp <- hampel(data[row,], k , t)
    y <- data[row,hp$ind]
    x <- hp$ind
    points(x,y,col="red")

}

But it is not good enough. It misses some small peaks. Is the time series assumption not correct? Any statistical fitting is possible? How to detect outliers in this data series considering each row of the data set as independent data series? It is known that total number of outliers is around 500.

Best Answer

How are you defining "outlier"? Looking at the example plot, I don't see any real outliers. There's just some noise in the data.

However, if you wanted to identify the points that were farthest from the fitted line, that would be fairly straightforward using the predict or residuals functions in the appropriate model. E.g.

x <- 1:100
y <- 3*x + rnorm(100)
m1 <- lm(y~x)
residm1 <- m1$residuals
ranks <- rank(residm1)

You could then select the largest n values for inspection or choose a minimum residual that would qualify as an "outlier".

Related Solutions

Outliers – How to Detect Initial Trend or Outliers in Data

I don't necessarily see this as being easily treated as a time series problem. To comment on the way you detect outliers, Pierce and Chauvenet are flawed procedures that should be discussed only for historical purposes and never used. Outlier detection involves more than just knowning what the variance should be, the underlying population distribution needs to be assumed. Dixon's test and Grubbs' test assume normality and are desined for single outliers. In their original form they can be very sensitive to masking. But Dixon has variants that enable you to detect multiple outliers as long as the number of outliers is small. Also as I have mentioned in other post Dixon's test is robust to departures from normality. In your case 10 is small enough but I worry about trying to detect as many as four out of a sample of only 10. There is a little bit of a time dependence with you knowing why the outliers are likely to be among the first few measurements. But as Bill Huber pointed out in comments the sequence of 10 is too short to do any sophisticated time series modelling.

Normally I argue that outliers should not be rejected but studied further. Here you seem to have a physical reason for higher variability and or trends with the early measurements. CUSUM charts are good for detecting trends but the sequence may be too short to do much. It may be that something informal such as dropping the first four out of ten will work as a practical matter even though it is not a formal statistical test.

Solved – Detecting outliers in count data

You cannot use the distance of an observation from a classical fit of your data to reliably detect outliers because the fitting procedure you use is itself liable to being pulled towards the outliers (this is called the masking effect). One simple way to reliably detect outliers is to use the general idea you suggested (distance from fit) but replacing the classical estimators by robust ones much less susceptible to be swayed by outliers. Below I present a general illustration of the idea and then discuss the solution for your specific problem.

An illustration: consider the following 20 observations drawn from a $\mathcal{N}(0,1)$ (rounded to the second digit):

x<-c(-2.21,-1.84,-.95,-.91,-.36,-.19,-.11,-.1,.18,
.3,.31,.43,.51,.64,.67,.72,1.22,1.35,8.1,17.6)

(the last two really ought to be .81 and 1.76 but have been accidentally misstyped).

Using a outlier detection rule based on comparing the statistic

$$\frac{|x_i-\text{ave}(x_i)|}{\text{sd}(x_i)}$$

to the quantiles of a normal distribution would never lead you to suspect that 8.1 is an outlier, leading you to estimate the $\text{sd}$ of the 'trimmed' series to be 2 (for comparison the raw, e.g. untrimmed, estimate of $\text{sd}$ is 4.35).

Had you used a robust statistic instead:

$$\frac{|x_i-\text{med}(x_i)|}{\text{mad}(x_i)}$$

and comparing the resulting robust $z$-scores to the quantiles of a normal, you would have correctly flagged the last two observations as outliers (and correctly estimated the $\text{sd}$ of the trimmed series to be 0.96).

(in the interest of completeness I should point out that some people, even in this age and day, prefer to cling the raw --untrimmed-- estimate of 4.35 rather than using the more precise estimate based on trimming but this is unintelligible to me)

For other distributions the situation is not that different, merely that you will have to pre-transform your data first. For example, in your case:

Suppose $X$ is your original count data. One trick is to use the transformation:

$$Y=2\sqrt{X}$$

and to exclude an observation as outlier if $Y>\text{med}(Y)+3$ (this rule is not symmetric and I for one would be very cautions about excluding observations from the left 'tail' of a count variable according to a data based threshold. Negative observations, Obviously, should be pretty safe to remove)

This is based on the idea that if $X$ is poisson, then

$$Y\approx \mathcal{N}(\text{med}(Y),1)$$

This approximation works reasonably well for poisson distributed data when $\lambda$ (the parameter of the poisson distribution) is larger than 3.

When $\lambda$ is smaller than 3 (or when the model governing the distribution of the majority of the data has a mode closer to 0 than a poisson $\lambda=3$, as in i.e. ZINB r.v.'s) the approximation tends to err on the conservative side (reject fewer data as outliers).

To see why this is considered 'conservative' consider that at the limit (when the data is binomial with very small $p$) no observation would ever be flagged as outlier by this rule and this is precisely the behaviour we want: to cause masking, outliers have to be able to drive the estimated parameters arbitrarily far away from their true values. When the data is drawn from a distribution with bounded support (such as the binomial), this can simply not happen...

Best Answer

Related Solutions

Outliers – How to Detect Initial Trend or Outliers in Data

Solved – Detecting outliers in count data

Related Question