Solved – Plotting summary statistics with mean, sd, min and max

boxplotdata visualizationr

I am from an economics background and usually in the discipline the summary statistics of the variables are reported in a table. However, I wish to plot them.

I could modify a box plot to allow it to display the mean, standard deviation, minimum and maximum but I don't wish to do so as box plots are traditionally used to display medians and Q1 and Q3.

All of my variables have different scales. It would be great if somebody could suggest a meaningful way by which I could plot these summary statistics. I can work with R or Stata.

Best Answer

There is a reason why Tukey's boxplot is universal, it can be applied to data derived from different distributions, from Gaussian to Poisson, etc. Median, MAD (median absolute deviation) or IQR (interquartile range) are more robust measures when data deviates from normality. However, mean and SD are more prone to outliers, and they should be interpreted with respect to the underlying distribution. The solution below is more suitable for normal or log-normal data. You may browse through a selection of robust measures here, and explore the WRS R package here.

# simulating dataset
set.seed(12)
d1 <- rnorm(100, sd=30)
d2 <- rnorm(100, sd=10)
d <- data.frame(value=c(d1,d2), condition=rep(c("A","B"),each=100))

# function to produce summary statistics (mean and +/- sd), as required for ggplot2
data_summary <- function(x) {
   mu <- mean(x)
   sigma1 <- mu-sd(x)
   sigma2 <- mu+sd(x)
   return(c(y=mu,ymin=sigma1,ymax=sigma2))
}

# require(ggplot2)
ggplot(data=d, aes(x=condition, y=value, fill=condition)) + 
geom_crossbar(stat="summary", fun.y=data_summary, fun.ymax=max, fun.ymin=min)

Additionally by adding + geom_jitter() or + geom_point() to the code above you can simultaneously visualise the raw data values.


Thanks to @Roland for pointing out the violin plot. It has an advantage in visualising probability density at the same time as summary statistic:

# require(ggplot2)
ggplot(data=d, aes(x=condition, y=value, fill=condition)) + 
geom_violin() + stat_summary(fun.data=data_summary)

Both examples are shown below.

enter image description here