How to Define the Multiplier Range for Variance Test-Based Outlier Detection?

meanoutliersstandard deviationvariance-test

I have a variance test based outliers detection algorithm. The algorithm is exposed with a visual application where the user can configure its parameters namely the multiplier.

The question is what is the range of values for the algorithm multiplier?

the algorithm basically is calculating the mean and standard deviation (sigma) of the data set and then comparing the dataset elements to the upper and lower bounds to flag an element as an outlier or not.

multiplier <- 2.3; 
upper_bound <- mean + multiplier * sigma; 
lower_bound <- mean - multiplier * sigma; 

the evolution of the two functions upper_bound (red) and lower_bound(blue) with the multiplier values is as follows:

enter image description here

But this graph doesn't give any clue on what range to define. Do you have any idea how to define this range?

Best Answer

To find the outliers, you cannot use the distance of an observation to a model through a rule such as:

$$\frac{|\hat{\mu}-x_i|}{\times \hat{\sigma}},\;i=1,\ldots,n$$

if your estimates of $(\hat{\mu},\hat{\sigma})$ are the classical ones (the usual mean/standard deviation) because the fitting procedure you use to obtain them is itself liable to being pulled towards the outliers (this is called the masking effect).

One simple way to reliably detect outliers however is to use the general idea you suggested (distance from fit) but replacing the classical estimators by robust ones much less susceptible to be swayed by outliers. Below I present a general illustration of the idea. If you give more information about your specific problem I can append my answer to address the particulars of your situation.

An illustration: consider the following 20 observations drawn from a $\mathcal{N}(0,1)$ (rounded to the second digit):

x<-c(-2.21,-1.84,-.95,-.91,-.36,-.19,-.11,-.1,.18,
.3,.31,.43,.51,.64,.67,.72,1.22,1.35,8.1,17.6)

(the last two really ought to be .81 and 1.76 but have been accidentally misstyped).

Using a outlier detection rule based on comparing the statistic

$$\frac{|x_i-\text{ave}(x_i)|}{\text{sd}(x_i)}$$

to the quantiles of a normal distribution would never lead you to suspect that 8.1 is an outlier, leading you to estimate the $\text{sd}$ of the 'trimmed' series to be 2 (for comparison the raw, e.g. untrimmed, estimate of $\text{sd}$ is 4.35).

Had you used a robust statistic instead:

$$\frac{|x_i-\text{med}(x_i)|}{\text{mad}(x_i)}$$

and comparing the resulting robust $z$-scores to the choosen quantiles of a candidate distribution (typically the standard normal if you can assume the $x_i$'s to be symetrically distributed) you would have correctly the last two observations as outliers (and correctly estimated the $\text{sd}$ of the trimmed series to be 0.96).