Solved – Boxplot with whiskers close to zero

boxplotskewness

I've made box and whisker plots of how fish prices vary in a rural African market (as a way of detecting change in food security). My data are skewed to the left-and so my lower whiskers are right at the bottom of the axis. I tried log-transforming but it looks even worse. Is this the best way to present my data? Thanks.

enter image description here

Best Answer

Box-plots are an anachronism --- use a violin plot instead: Given the skew of your price data, I would recommend you plot it on a logarithmic scale, and use a violin plot. This plot shows a density estimate of the data at each time point, which gives a clearly depiction of the shape of the distribution than the box-plot. If desired, you can include the quantiles in the violin plot, but this is generally unnecessary, since the shape gives the viewer a reasonable depiction of the changes in location over time. I would also recommend that when you plot on a logarithmic scale, you still label the plot with the original price values (not their logarithm), but just show this via appropriate logarithmic values on the axis. Here is an example of implementation of this kind of plot in R.

#Load libraries and set theme
library(ggplot2);
THEME <- theme(plot.title    = element_text(hjust = 0.5, size = 14, face = 'bold'),
               plot.subtitle = element_text(hjust = 0.5, face = 'bold'));

#Generate mock data set (since you haven't given your data)
YEARS    <- c('2002-3', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015');
MU       <- c(1.03, 1.02, 1.28, 1.01, 0.99, 1.04, 1.24, 1.35, 1.29);
SIG      <- c(0.61, 0.65, 0.66, 0.62, 0.59, 0.62, 0.63, 0.63, 0.60);
N        <- c(132, 130, 128, 138, 140, 131, 133, 138, 142);
LOGPRICE <- sapply(N, rnorm);
for (n in 1:length(N)) { LOGPRICE[[n]] <- LOGPRICE[[n]]*SIG[n] + MU[n]; }
LOGPRICE <- unlist(LOGPRICE);
DATA  <- data.frame(Year  = rep(YEARS, N),
                    Price = exp(LOGPRICE));

#Generate violin plot of data
FIGURE <- ggplot(data = DATA, aes(x = Year, y = Price)) + 
            geom_violin(fill = 'blue', draw_quantiles = c(0.25, 0.5, 0.75)) + 
            scale_y_log10(breaks = scales::trans_breaks("log10", function(x) 10^x),
                labels = scales::trans_format("log10", scales::math_format(10^.x))) +
            expand_limits(y = c(10^(-0.5), 10^(1.5))) + THEME + 
            ggtitle('Fish Price - Katima Mulilo Market') + 
            xlab(NULL) + ylab('Price - $/kg');

#Print the plot
FIGURE;

enter image description here

You can see that this price data shows up pretty well on a logarithmic scale, which means that the variations in price tend to be scale variations. Also note that the vertical axis on this plot still shows the values in dollars-per-kilogram, but the measurement labels are in powers of ten, putting it on a logarithmic scale, but with labels in the original measurement unit. This is generally the most useful way to present data of this kind.

Related Question