There are at least two different factors potentially contributing to differences in the boxplots produced by Matlab and pgfplots.
1. <=
and >=
(Matlab) vs <
and >
(pgfplots)
There is a difference in the definitions of whiskers and outliers.
From the manual of pgfplots (I have emphasized the key fact):
lower whisker is the smallest data value which is larger than lower quartile−1.5 · IQR
and
upper whisker is the largest data value which is smaller than upper quartile+1.5 · IQR
From the manual of Matlab (emphasis added):
Points are drawn as outliers if they are larger than Q3+W*(Q3-Q1) or
smaller than Q1-W*(Q3-Q1), where Q1 and Q3 are the 25th and 75th
percentiles, respectively. The default value 1.5 corresponds to
approximately +/- 2.7 sigma and 99.3 coverage if the data are normally
distributed. The plotted whisker extends to the adjacent value, which
is the most extreme data value that is not an outlier.
2. Different methods for computing box limits / quartiles
It would be all too easy to say that the box limits are the 1st and 3rd quartile (a.k.a. 25th and 75th percentile, a.k.a. quantiles with probabilities 0.25 and 0.75) and leave it at that. Alas, there are many methods for computing quantiles. Without going into too much detail, there are no less than 9 different quantile()
method variants in R. For the example data set, these methods give 7 unique results for the pair of numbers (25th and 75th percentile). These are:
- 0 and 5
- 0.5 and 4.5
- 0.75 and 6.25
- 0.9166667 and 5.4166667
- 0.9375 and 5.3125
- 1 and 5
- 1.25 and 4.75
Matlab finds the box limits to be 1 and 5. According to @Jake (see comments), the limits in pgfplots are 1.5 and 4.5, which can also be approximately confirmed by looking at the picture attached by the original poster. Note that this corresponds to yet another definition of quantile. The pgfplots manual, Revision 1.11 (2014/08/04), states the following about the computation of quantiles:
I am not sure how exactly this maps the quartiles of the example data set to 1.5 and 4.5. We have x1=-9, x2=0, x3=1, x4=2, x5=x6=3, x7=4, x8=5, x9=10 and x10=14. Following the formula in the pgfplots manual, we get lower quartile 0.5*(x2+x3)=0.5*(0+1)=0.5 and upper quartile 0.5*(x7+x8)=0.5*(4+5)=4.5. One consequence of using the formula presented in the pgfplots manual is that for computing small quantiles, one would need the non-existent value x0.
How this works with the example data
Assuming what I wrote above is correct and everything works as documented, we can work through a boxplot of the example data from the perspective of both Matlab and pgfplots.
Matlab
Assuming that the quartiles are 1 and 5, the inter-quartile range (IQR) is 4. Now 1.5*IQR is 6. Third quartile + 1.5*IQR is 11. Data value 14 is larger than that which makes it an outlier. However, 10 is not an outlier. Thus the upper whisker extends to 10.
pgfplots
Assuming that the quartiles are 1.5 and 4.5, the inter-quartile range (IQR) is 3. Now 1.5*IQR is 4.5. Third quartile + 1.5*IQR is 9. Data value 14 is larger than that which makes it an outlier. Also 10 is an outlier. 5 is the largest number which is not an outlier. Thus the upper whisker extends to 5.
Conclusion
I cannot tell for sure which of the boxplot computation methods is better or correct. I can add another data point: In case of the example data, boxplot()
in R gives the same result as Matlab.
One should also note that due to the limitations of computer arithmetics and the discrete nature of the whisker locations, different implementations using the same formula may also produce a different result depending on the data set.
After some trying I found the solution!
(I already had the correct idea in mind but this didn't work in the first run...)
Recap: A linear function has this function: a · x + b
And pgfplotstable
lets you access
a
with \pgfplotstableregressiona
and
b
with \pgfplotstableregressionb
of the linear regression plot.
So I used this code to access the value: (xmin was previously defined)
% get xmax (e.g. 2021-11-25)
\pgfkeysgetvalue{/pgfplots/xmax}{\xmax}
% convert xmax (date to julian, which is an integer number)
\newcount\xmaxjulian
\pgfcalendardatetojulian{\xmax}{\xmaxjulian}
\xmaxjulian=\numexpr\xmaxjulian-\xmin % remove offset
% store y value at xmax in \var
\pgfmathsetmacro{\var}{\pgfplotstableregressiona * (\the\xmaxjulian) + \pgfplotstableregressionb}
% print legend:
\addlegendentry{Forecast utilization in one year:
\luaexec{ tex.sprint ( string.format ( "\%.2f" , \var ) ) } TB}
Note:
That code \luaexec{ tex.sprint ( string.format ( "\%.2f" , \var ) ) }
rounds my value to two decimal places. It only works with LuaLaTeX and you need to add \usepackage{luacode}
to your code.
Note 2:
You can print the function of the linear regression with this:
Formula of linear regression:
$\pgfmathprintnumber{\pgfplotstableregressiona}\cdot x
\pgfmathprintnumber[print sign]{\pgfplotstableregressionb}$
Note 3:
I added a MWE example here: (you need LuaLaTeX in order to run it)
\documentclass{article}
\usepackage[letterpaper,top=2cm,bottom=2cm,left=3cm,right=3cm,marginparwidth=1.75cm]{geometry}
\usepackage{datatool}
\usepackage{luacode}
\usepackage{pgfplots}
\usepackage{pgfplotstable}
\usepackage{pgfcalendar}
\usepgfplotslibrary{dateplot}
\pgfplotsset{compat=newest}
\begin{filecontents*}{data.csv}
date, size
2021-04-01, 1.42
2021-05-01, 1.46
2021-06-01, 1.58
2021-07-01, 1.55
2021-08-01, 1.69
\end{filecontents*}
\begin{document}
\thispagestyle{empty}
\centering
\pgfplotstableread[col sep=comma]{data.csv}\loadedtable
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% add new column with Julian integer numbers
\newcount\julianday
\pgfplotstablecreatecol[
create col/assign/.code={
\pgfcalendardatetojulian{\thisrowno{0}}{\julianday}
\edef\entry{\the\julianday}
\pgfkeyslet{/pgfplots/table/create col/next content}\entry
},
]{JulianDay}{\loadedtable}
\pgfplotstablegetelem{0}{JulianDay}\of{\loadedtable}
\pgfmathtruncatemacro{\xmin}{\pgfplotsretval}
\pgfplotstablecreatecol[
expr={\thisrow{JulianDay}-\xmin},
]{JulianDayMod}{\loadedtable}
% source: https://tex.stackexchange.com/questions/367339/linear-regression-with-dates-on-x-axis-in-pgfplots
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% compute dates for xmin and xmax
% get last date in file
\DTLloaddb[%
noheader,%
keys={date,2}%
]{myDB}{data.csv}
\DTLforeach*{myDB}{\CurrentA=date}{%
\xdef\LastDate{\CurrentA}
}
\newcount\DateOfLastScan
\pgfcalendardatetojulian{\LastDate{}}{\DateOfLastScan}
\pgfcalendarjuliantodate{\DateOfLastScan}{\theyear}{\themonth}{\theday}
\year=\theyear
\month=\themonth
\day=\theday
% get date two years ago
\year=\numexpr\year-2
\edef\twoyearsago{\the\year-\ifnum\the\month<10 0\fi\the\month-\ifnum\the\day<10 0\fi\the\day}
\year=\numexpr\year+2
% get date one year in the future
\year=\numexpr\year+1
\edef\oneyearfuture{\the\year-\ifnum\the\month<10 0\fi\the\month-\ifnum\the\day<10 0\fi\the\day}
\year=\numexpr\year-1
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% print plot
\begin{tikzpicture}
\begin{axis}
[
date coordinates in=x,
ylabel={Terabyte},
xticklabel style={rotate=90,anchor=near xticklabel},
xticklabel=\scriptsize\texttt{\day-\month-\year},
yticklabel={\luaexec{ tex.sprint ( string.format ( "\%.2f" , \tick ) ) }},
grid,
tick align=inside,
width=\textwidth,
xmin=\twoyearsago{},
xmax=\oneyearfuture{},
ymin=-0.05,
legend pos = south east,
legend image post style={only marks, mark=none},
legend cell align={left},
]
% linear regression
\addplot [line width=10pt, opacity=.3, red, shorten >= -10cm, shorten <= -10cm] table [
x index=0,
% now we can use the newly created column to do the linear regression
y={create col/linear regression={
x=JulianDayMod,
y=size,
}}
] {\loadedtable};
% contents from CSV file
\addplot[thick, no marks, solid] table[col sep=comma, x index=0, y index=1]{\loadedtable};
% get y value at xmax
\pgfkeysgetvalue{/pgfplots/xmax}{\xmax}
\newcount\xmaxjulian
\pgfcalendardatetojulian{\xmax}{\xmaxjulian}
\xmaxjulian=\numexpr\xmaxjulian-\xmin
\pgfmathsetmacro{\var}{\pgfplotstableregressiona * (\the\xmaxjulian) + \pgfplotstableregressionb}
% add legends
\addlegendentry{Forecast utilization in one year: \textbf{\luaexec{%
tex.sprint ( string.format ( "\%.2f" , \var ) ) } TB}}
\addlegendentry{Formula of linear regression:
$\pgfmathprintnumber{\pgfplotstableregressiona}\cdot x
\pgfmathprintnumber[print sign]{\pgfplotstableregressionb}$
}
% add invisible point at end of linear regression to keep the end of line inside of the plot
\addplot[forget plot,draw=none] coordinates {(\xmax,\var)};
\end{axis}
\end{tikzpicture}
\end{document}
Result:
Best Answer
To prevent PGFPlots from adding padding around your data, set
enlargelimits=false
: