Solved – How to interpret whiskers of a box plot when there are outliers

boxplotoutliers

When I look at the definition of box plots, the whiskers are said to indicate the extreme values. So, if I look at height in a sample of people, the lower whisker should denote the height of the lowest person in the sample and vice versa for the higher whisker. However, when you include fliers (outliers) they indicate values that lie below and above these values? That means that the lower whisker may NOT indicate the height of the lowest person and thinking of the whiskers as the extreme values no longer holds.

I am confused.

Do the whiskers indicate the boundaries of a 95 percent confidence interval so that the lower whisker only indicates the lowest height within that interval but not within the sample? The lowest flier value then indicates the lowest value in the sample?

Best Answer

The question admits confusion and is itself contradictory. If whiskers extend to the extremes, then how can there be plotted data points beyond them?

Let's back up. (Incidentally, the term flier is I think not in general use statistically and I don't try to guess what it might mean, except presumably as a (loose) synonym for outlier.)

There is not a universal definition of box plots. Nor do published box plots always, or even often, come with an explanation of the precise rules used in their construction.

So much for the bad news.

In general, the box of a box plot shows the median and quartiles. A whisker is drawn outwards from each quartile and data points beyond the whiskers are shown individually. Occasionally, box plots are decorated with other summaries or details, such as point symbols showing means or labelling for selected outliers (this big river is "Amazon", or whatever).

Whiskers could extend

  1. All the way to the maximum or minimum, in which case no data values are plotted as points beyond.

  2. As defined by a convention used by Tukey, J.W. 1977. Exploratory data analysis. Reading, MA: Addison-Wesley. The interquartile range IQR is first calculated as IQR = upper quartile $-$ lower quartile. Tukey himself at various points used terms such as hinges and fourths for (approximate) quartiles, but these terms are slowly fading away; most recent treatments have reverted to the earlier term quartiles. There are several conventions for exactly how quartiles are calculated, which I will treat as small print here.

    • One whisker is drawn from the lower quartile down to the lowest data point $\ge$ the lower quartile $-$ 1.5 IQR.

    • The other whisker is drawn from the upper quartile to the highest data point $\le$ the upper quartile $+$ 1.5 IQR.

    • Points beyond either whisker are plotted separately. These are often dubbed outliers, but keep reading for some opinionated discussion.

    • Note that in principle that the whisker could be of zero length and not visible as such, as when for example the minimum and the lower quartile are identical, as is quite common with certain kinds of data. Example: counted data including zeros in which at least 25% of the values are zero.

    • Some programs allow you to specify multipliers other than 1.5. Tukey himself before 1977 used multipliers of 1 and/or 2.

  3. Other conventions exist, such as drawing whiskers to particular paired percentiles such as those for 1 and 99%, 5 and 95% and 10 and 90%. Usually, data points beyond those percentiles are shown individually as with #2. My personal view is that these conventions are easier to explain than #2.

I would assert that box plots do not show confidence intervals at all, in so far as any display of confidence intervals is usually called something else, or just explained as such.

On outliers, here is one opinion. Others may be found.

Tukey's intention was essentially that the researcher would think about how to handle data points plotted individually, and (for example) that a straggly box plot with outliers might point to analysis on a transformed scale.

It is quite common to see discussions in which lying beyond nearer quartile $\pm$ 1.5 IQR is treated as a criterion for outliers and not unusual to see inclinations to remove any such data points from further analyses. These attitudes don't match the intention behind that convention. Naturally, tools are often used in ways their creators did not intend.

Making a practical criterion for what is to be displayed separately a rigid rule for outlier identification is in my view misconceived in principle and if applied usually very poor practice. One reason is that some such values are expected in moderate or large samples even in a very well behaved distribution (e.g. a normal or Gaussian). More often, such data points are entirely to be expected in many skewed distributions. Far from being rogue points that should be excluded, such points are key parts of the data. Yet another, the context provided by other variables and/or subject-matter knowledge should inform any decision on outliers, even the decision just to flag or label them as such.