[Tex/LaTex] pgfplots generating incorrect boxplots

boxplotpgfplots

While trying to reproduce some boxplots using pgfplots (and analyzing samples automatically), I noticed that in some cases I was getting different plots from Matlab. I have left all the settings in default, which means that the thresholds for outliers and whiskers should be the same in both cases. (Thresholds for outliers by default are q1-1.5*(q3-q1) and q3+1.5*(q3-q1) where q1 and q3 are the 25th and the 75th percentiles respectively). However, Matlab and pgfplots generate plots with different whisker and outlier positions.

Plot from pgfplots:

Boxplot from pgfplots

Plot from Matlab:
Boxplot from Matlab

Following is the code and the data file that I am using to generate these plots:

Latex:

\documentclass{minimal}
\usepackage{pgfplots}
\usepgfplotslibrary{statistics}
\pgfplotsset{compat=1.8}

\begin{document}
\begin{tikzpicture}
  \begin{axis}[
    boxplot/draw direction = y,
    xmin = -2,
    xmax = 4]
    \addplot+[boxplot] table[y=Var1]{testdata.txt};
  \end{axis}
\end{tikzpicture}
\end{document}

Matlab:

T=readtable('testdata.txt','delimiter','\t');
boxplot(T{:,:})

The data file:

Var1
0
1
2
3
4
5
3
10
14
-9

Any idea why they generate different plots? Which one is correct?

Best Answer

There are at least two different factors potentially contributing to differences in the boxplots produced by Matlab and pgfplots.

1. <= and >= (Matlab) vs < and > (pgfplots)

There is a difference in the definitions of whiskers and outliers.

From the manual of pgfplots (I have emphasized the key fact):

lower whisker is the smallest data value which is larger than lower quartile−1.5 · IQR

and

upper whisker is the largest data value which is smaller than upper quartile+1.5 · IQR

From the manual of Matlab (emphasis added):

Points are drawn as outliers if they are larger than Q3+W*(Q3-Q1) or smaller than Q1-W*(Q3-Q1), where Q1 and Q3 are the 25th and 75th percentiles, respectively. The default value 1.5 corresponds to approximately +/- 2.7 sigma and 99.3 coverage if the data are normally distributed. The plotted whisker extends to the adjacent value, which is the most extreme data value that is not an outlier.

2. Different methods for computing box limits / quartiles

It would be all too easy to say that the box limits are the 1st and 3rd quartile (a.k.a. 25th and 75th percentile, a.k.a. quantiles with probabilities 0.25 and 0.75) and leave it at that. Alas, there are many methods for computing quantiles. Without going into too much detail, there are no less than 9 different quantile() method variants in R. For the example data set, these methods give 7 unique results for the pair of numbers (25th and 75th percentile). These are:

  • 0 and 5
  • 0.5 and 4.5
  • 0.75 and 6.25
  • 0.9166667 and 5.4166667
  • 0.9375 and 5.3125
  • 1 and 5
  • 1.25 and 4.75

Matlab finds the box limits to be 1 and 5. According to @Jake (see comments), the limits in pgfplots are 1.5 and 4.5, which can also be approximately confirmed by looking at the picture attached by the original poster. Note that this corresponds to yet another definition of quantile. The pgfplots manual, Revision 1.11 (2014/08/04), states the following about the computation of quantiles:

pgfplots manual, about quantiles

I am not sure how exactly this maps the quartiles of the example data set to 1.5 and 4.5. We have x1=-9, x2=0, x3=1, x4=2, x5=x6=3, x7=4, x8=5, x9=10 and x10=14. Following the formula in the pgfplots manual, we get lower quartile 0.5*(x2+x3)=0.5*(0+1)=0.5 and upper quartile 0.5*(x7+x8)=0.5*(4+5)=4.5. One consequence of using the formula presented in the pgfplots manual is that for computing small quantiles, one would need the non-existent value x0.

How this works with the example data

Assuming what I wrote above is correct and everything works as documented, we can work through a boxplot of the example data from the perspective of both Matlab and pgfplots.

Matlab

Assuming that the quartiles are 1 and 5, the inter-quartile range (IQR) is 4. Now 1.5*IQR is 6. Third quartile + 1.5*IQR is 11. Data value 14 is larger than that which makes it an outlier. However, 10 is not an outlier. Thus the upper whisker extends to 10.

pgfplots

Assuming that the quartiles are 1.5 and 4.5, the inter-quartile range (IQR) is 3. Now 1.5*IQR is 4.5. Third quartile + 1.5*IQR is 9. Data value 14 is larger than that which makes it an outlier. Also 10 is an outlier. 5 is the largest number which is not an outlier. Thus the upper whisker extends to 5.

Conclusion

I cannot tell for sure which of the boxplot computation methods is better or correct. I can add another data point: In case of the example data, boxplot() in R gives the same result as Matlab.

One should also note that due to the limitations of computer arithmetics and the discrete nature of the whisker locations, different implementations using the same formula may also produce a different result depending on the data set.

Related Question