Solved – the basis for the Box and Whisker Plot definition of an outlier

boxplotnormality-assumptionoutliersqq-plot

The standard definition of an outlier for a Box and Whisker plot is points outside of the range $\left\{Q1-1.5IQR,Q3+1.5IQR\right\}$, where $IQR= Q3-Q1$ and $Q1$ is the first quartile and $Q3$ is the third quartile of the data.

What is the basis for this definition? With a large number of points, even a perfectly normal distribution returns outliers.

For example, suppose you start with the sequence:

xseq<-seq(1-.5^1/4000,.5^1/4000, by = -.00025)

This sequence creates a percentile ranking of 4000 points of data.

Testing normality for the qnorm of this series results in:

shapiro.test(qnorm(xseq))

    Shapiro-Wilk normality test

data:  qnorm(xseq)
W = 0.99999, p-value = 1

ad.test(qnorm(xseq))

    Anderson-Darling normality test

data:  qnorm(xseq)
A = 0.00044273, p-value = 1

The results are exactly as expected: the normality of a normal distribution is normal. Creating a qqnorm(qnorm(xseq)) creates (as expected) a straight line of data:

qqnorm plot of data

If a boxplot of the same data is created, boxplot(qnorm(xseq)) produces the result:

boxplot of the data

The boxplot, unlike shapiro.test, ad.test, or qqnorm identifies several points as outliers when the sample size is sufficiently large (as in this example).

Best Answer

Boxplots

Here is a relevant section from Hoaglin, Mosteller and Tukey (2000): Understanding Robust and Exploratory Data Analysis. Wiley. Chapter 3, "Boxplots and Batch Comparison", written by John D. Emerson and Judith Strenio (from page 62):

[...] Our definition of outliers as data values that are smaller than $F_{L}-\frac{3}{2}d_{F}$ or larger than $F_{U}+\frac{3}{2}d_{F}$ is somewhat arbitrary, but experience with many data sets indicates that this definition serves well in identifying values that may require special attention.[...]

$F_{L}$ and $F_{U}$ denote the first and third quartile, whereas $d_{F}$ is the interquartile range (i.e. $F_{U}-F_{L}$).

They go on and show the application to a Gaussian population (page 63):

Consider the standard Gaussian distribution, with mean $0$ and variance $1$. We look for population values of this distribution that are analogous to the sample values used in the boxplot. For a symmetric distribution, the median equals the mean, so the population median of the standard Gaussian distribution is $0$. The population fourths are $-0.6745$ and $0.6745$, so the population fourth-spread is $1.349$, or about $\frac{4}{3}$. Thus $\frac{3}{2}$ times the fourth-spread is $2.0235$ (about $2$). The population outlier cutoffs are $\pm 2.698$ (about $2\frac{2}{3}$), and they contain $99.3\%$ of the distribution. [...]

So

[they] show that if the cutoffs are applied to a Gaussian distribution, then $0.7\%$ of the population is outside the outlier cutoffs; this figure provides a standard of comparison for judging the placement of the outlier cutoffs [...].

Further, they write

[...] Thus we can judge whether our data seem heavier-tailed than Gaussian by how many points fall beyond the outlier cutoffs. [...]

They provide a table with the expected proportion of values that fall outside the outlier cutoffs (labelled "Total % Out"):

Table 3-2

So these cutoffs where never intended to be a strict rule about what data points are outliers or not. As you noted, even a perfect Normal distribution is expected to exhibit "outliers" in a boxplot.


Outliers

As far as I know, there is no universally accepted definition of outlier. I like the definition by Hawkins (1980):

An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.

Ideally, you should only treat data points as outliers once you understand why they don't belong to the rest of the data. A simple rule is not sufficient. A good treatment of outliers can be found in Aggarwal (2013).

References

Aggarwal CC (2013): Outlier Analysis. Springer.
Hawkins D (1980): Identification of Outliers. Chapman and Hall.
Hoaglin, Mosteller and Tukey (2000): Understanding Robust and Exploratory Data Analysis. Wiley.