Detecting Outliers Using Standard Deviations – Methods and Techniques

outliers

Following my question here, I am wondering if there are strong views for or against the use of standard deviation to detect outliers (e.g. any datapoint that is more than 2 standard deviation is an outlier).

I know this is dependent on the context of the study, for instance a data point, 48kg, will certainly be an outlier in a study of babies' weight but not in a study of adults' weight.

Outliers are the result of a number of factors such as data entry mistakes. In my case, these processes are robust.

I guess the question I am asking is: Is using standard deviation a sound method for detecting outliers?

Best Answer

Some outliers are clearly impossible. You mention 48 kg for baby weight. This is clearly an error. That's not a statistical issue, it's a substantive one. There are no 48 kg human babies. Any statistical method will identify such a point.

Personally, rather than rely on any test (even appropriate ones, as recommended by @Michael) I would graph the data. Showing that a certain data value (or values) are unlikely under some hypothesized distribution does not mean the value is wrong and therefore values shouldn't be automatically deleted just because they are extreme.

In addition, the rule you propose (2 SD from the mean) is an old one that was used in the days before computers made things easy. If N is 100,000, then you certainly expect quite a few values more than 2 SD from the mean, even if there is a perfect normal distribution.

But what if the distribution is wrong? Suppose, in the population, the variable in question is not normally distributed but has heavier tails than that?

Related Question