How to Define the Multiplier Range for Variance Test-Based Outlier Detection?

meanoutliersstandard deviationvariance-test

I have a variance test based outliers detection algorithm. The algorithm is exposed with a visual application where the user can configure its parameters namely the multiplier.

The question is what is the range of values for the algorithm multiplier?

the algorithm basically is calculating the mean and standard deviation (sigma) of the data set and then comparing the dataset elements to the upper and lower bounds to flag an element as an outlier or not.

multiplier <- 2.3; 
upper_bound <- mean + multiplier * sigma; 
lower_bound <- mean - multiplier * sigma;

the evolution of the two functions upper_bound (red) and lower_bound(blue) with the multiplier values is as follows:

enter image description here

But this graph doesn't give any clue on what range to define. Do you have any idea how to define this range?

Best Answer

To find the outliers, you cannot use the distance of an observation to a model through a rule such as:

$$\frac{|\hat{\mu}-x_i|}{\times \hat{\sigma}},\;i=1,\ldots,n$$

if your estimates of $(\hat{\mu},\hat{\sigma})$ are the classical ones (the usual mean/standard deviation) because the fitting procedure you use to obtain them is itself liable to being pulled towards the outliers (this is called the masking effect).

One simple way to reliably detect outliers however is to use the general idea you suggested (distance from fit) but replacing the classical estimators by robust ones much less susceptible to be swayed by outliers. Below I present a general illustration of the idea. If you give more information about your specific problem I can append my answer to address the particulars of your situation.

An illustration: consider the following 20 observations drawn from a $\mathcal{N}(0,1)$ (rounded to the second digit):

x<-c(-2.21,-1.84,-.95,-.91,-.36,-.19,-.11,-.1,.18,
.3,.31,.43,.51,.64,.67,.72,1.22,1.35,8.1,17.6)

(the last two really ought to be .81 and 1.76 but have been accidentally misstyped).

Using a outlier detection rule based on comparing the statistic

$$\frac{|x_i-\text{ave}(x_i)|}{\text{sd}(x_i)}$$

to the quantiles of a normal distribution would never lead you to suspect that 8.1 is an outlier, leading you to estimate the $\text{sd}$ of the 'trimmed' series to be 2 (for comparison the raw, e.g. untrimmed, estimate of $\text{sd}$ is 4.35).

Had you used a robust statistic instead:

$$\frac{|x_i-\text{med}(x_i)|}{\text{mad}(x_i)}$$

and comparing the resulting robust $z$-scores to the choosen quantiles of a candidate distribution (typically the standard normal if you can assume the $x_i$'s to be symetrically distributed) you would have correctly the last two observations as outliers (and correctly estimated the $\text{sd}$ of the trimmed series to be 0.96).

Related Solutions

Solved – Confused by location of fences in box-whisker plots

The whisker only goes as far as the maximum (minimum) point less (greater) than the upper (lower) fence value. For example, if $q_3+k \times IQR=10$ and the data set had values $\lbrace\dots,5,6,7,8,12\rbrace$, then the whisker would only goes as far as 8, and 12 would be the "outlier".

So, in short, the definitions for the whiskers, $q_3 +k \times IQR$ and $q_1-k\times IQR$, only represent the maximum extent to which the whiskers could go, if there were data points at those values. Thus they don't have to be (and rarely are) the same length.

Solved – Simple algorithm for online outlier detection of a generic time series II: Daily cycle within annual

I'm not an expert on senor data. However, your data reminded me of click stream/ internet data. I recently stumbled upon twitters anomaly detection algorithm. I have not personally had great success with this method, but wanted to give a try on your data because of similarity of this data with data generated by click stream/tweets. I used annual cycle for seasonality (365*48 = 17520).

data <- read.csv("C:/Desktop/LOESS2.csv", header = TRUE)

data.outliers.annual <- AnomalyDetectionVec(data[,2],period = 17520, plot=TRUE,direction = "both")

Below is the plot from the above data:

enter image description here

Although this is not ideal, based on visual inspection this method does a decent job of detecting obvious outliers.

An alternative approach would be to look into state space methods to capture multi seasonality and them simultaneously detect outliers. I'll try to post if I find time.

Hope this was helpful