Interpretation of the box plot (alternatively box and whisker plot) rests in understanding that it provides a graphical representation of a five number summary, i.e. minimum, 1st quartile, median, 3rd quartile and maximum. The box encompasses 50% of the observations. The ends of the whiskers (vertical lines emanating from the top and bottom of the box) typically show where the minimum and maximum lie. However, where possible outliers exist (sometimes assessed based on 1.5 $\times$ interquartile range) points are added, as is the case for your figure.
It may be useful for you to look at a histogram or density plots on specific categories of the data as that may help you understand what the box plot is saying.
@Glen_b rightly indicates that left skew is evident and the central tendency for the 5th level of strength of feeling is lower than the others. It is difficult however to see whether that difference would be statistically significant or not.
It is impossible to know without knowing more about what your software thinks is the right way to draw a box and whisker plot. It is even more difficult without a numeric scale to anchor the results on. Regardless, there are a number of different guidelines in this regard (in general). However, we can always resort to reading the documentation
- boxes: the main body of the boxplot showing the quartiles and the
median’s confidence intervals if enabled.
- medians: horizonal lines at
the median of each box.
- whiskers: the vertical lines extending to the
most extreme, n-outlier data points.
- caps: the horizontal lines at the
ends of the whiskers.
- fliers: points representing data that extend
beyone (sic) the whiskers (outliers).
Given the values of 16.5, 17.14, 13.5, and 16.75, the value of 13.5 is being treated as a 'flier'. The boxes are stretching from Q1 to Q3. The horizontal line is the median (aka Q2). The exact calculation of these values has a number of different approaches, but I'll just grab the handy values from R (quantile
defaults) of 15.75 for Q1, 16.625 for Q2, and 16.8475 for Q3. Although the documentation cited above is unclear, it appears that the whiskers and caps extend to the most extreme, n-outlier data points excluding the 'fliers' (more on this later). Therefore, we can expect them to extend from 16.50 to 17.14. That is, they will extend to a value closer to the median than Q1 (at the bottom) and slightly beyond Q3 (at the top)... which is exactly what we see.
However, given the circular definition of whiskers and fliers... you have to look further up in the docs to see that whiskers are "a function of the inner quartile range. They extend to the most extreme data point within ( whis*(75%-25%) ) data range" where 'whis' has a default of 1.5. Combining these sources of information, we can see that whiskers would plot points 1.5 times the interquartile range, but they stop at the most extreme data point inside that range. Data points beyond that range are dubbed fliers and plotted as such.
So, in response to the second question it is 'valid'...it isn't my preferred way of seeing boxplots drawn, but that doesn't make it invalid. As I mentioned there is no one convention in this regard. So long as you know what the boxplot is drawing, and it draws it in that way - then it is at least reliable. Valid will be a value judgement you have to make for yourself.
My descriptions above, plus the docs should help you interpret your boxplot, but just in case:
- Central Line: Median
- Edges of Boxes: Q1 and Q3
- Limits of Whiskers: The minimum and maximum values inside the inflated inter-quartile range (e.g. whis*(75%-25%) where whis defaults to 1.5)
- Little plus signs: 'fliers', data-points beyond the limits of the whiskers
Best Answer
The question admits confusion and is itself contradictory. If whiskers extend to the extremes, then how can there be plotted data points beyond them?
Let's back up. (Incidentally, the term flier is I think not in general use statistically and I don't try to guess what it might mean, except presumably as a (loose) synonym for outlier.)
There is not a universal definition of box plots. Nor do published box plots always, or even often, come with an explanation of the precise rules used in their construction.
So much for the bad news.
In general, the box of a box plot shows the median and quartiles. A whisker is drawn outwards from each quartile and data points beyond the whiskers are shown individually. Occasionally, box plots are decorated with other summaries or details, such as point symbols showing means or labelling for selected outliers (this big river is "Amazon", or whatever).
Whiskers could extend
All the way to the maximum or minimum, in which case no data values are plotted as points beyond.
As defined by a convention used by Tukey, J.W. 1977. Exploratory data analysis. Reading, MA: Addison-Wesley. The interquartile range IQR is first calculated as IQR = upper quartile $-$ lower quartile. Tukey himself at various points used terms such as hinges and fourths for (approximate) quartiles, but these terms are slowly fading away; most recent treatments have reverted to the earlier term quartiles. There are several conventions for exactly how quartiles are calculated, which I will treat as small print here.
One whisker is drawn from the lower quartile down to the lowest data point $\ge$ the lower quartile $-$ 1.5 IQR.
The other whisker is drawn from the upper quartile to the highest data point $\le$ the upper quartile $+$ 1.5 IQR.
Points beyond either whisker are plotted separately. These are often dubbed outliers, but keep reading for some opinionated discussion.
Note that in principle that the whisker could be of zero length and not visible as such, as when for example the minimum and the lower quartile are identical, as is quite common with certain kinds of data. Example: counted data including zeros in which at least 25% of the values are zero.
Some programs allow you to specify multipliers other than 1.5. Tukey himself before 1977 used multipliers of 1 and/or 2.
Other conventions exist, such as drawing whiskers to particular paired percentiles such as those for 1 and 99%, 5 and 95% and 10 and 90%. Usually, data points beyond those percentiles are shown individually as with #2. My personal view is that these conventions are easier to explain than #2.
I would assert that box plots do not show confidence intervals at all, in so far as any display of confidence intervals is usually called something else, or just explained as such.
On outliers, here is one opinion. Others may be found.
Tukey's intention was essentially that the researcher would think about how to handle data points plotted individually, and (for example) that a straggly box plot with outliers might point to analysis on a transformed scale.
It is quite common to see discussions in which lying beyond nearer quartile $\pm$ 1.5 IQR is treated as a criterion for outliers and not unusual to see inclinations to remove any such data points from further analyses. These attitudes don't match the intention behind that convention. Naturally, tools are often used in ways their creators did not intend.
Making a practical criterion for what is to be displayed separately a rigid rule for outlier identification is in my view misconceived in principle and if applied usually very poor practice. One reason is that some such values are expected in moderate or large samples even in a very well behaved distribution (e.g. a normal or Gaussian). More often, such data points are entirely to be expected in many skewed distributions. Far from being rogue points that should be excluded, such points are key parts of the data. Yet another, the context provided by other variables and/or subject-matter knowledge should inform any decision on outliers, even the decision just to flag or label them as such.