Solved – Comparing and visualising highly skewed distributions

data visualizationhistogramskewness

The data I'm working with are highly skewed, with the vast majority of data concentrated at 0. It seems really hard to highlight the differences between these kind of distributions:

gamma1 <- rgamma(10000, shape=0.05, rate=1)
gamma2 <- rgamma(10000, shape=0.055, rate=0.98)
gamma3 <- rgamma(10000, shape=0.06, rate=0.95)

c(mean(gamma1), mean(gamma2), mean(gamma3))

[1] 0.04845668 0.05253655 0.05797983

ks.test(gamma1, gamma2)

 Two-sample Kolmogorov-Smirnov test

  data:  gamma1 and gamma2  
  D = 0.0433, p-value = 1.44e-08  
  alternative hypothesis: two-sided

ks.test(gamma2, gamma3)

 Two-sample Kolmogorov-Smirnov test

 data:  gamma2 and gamma3  
 D = 0.0456, p-value = 1.864e-09  
 alternative hypothesis: two-sided  

ks.test(gamma1, gamma3)

Two-sample Kolmogorov-Smirnov test

 data:  gamma1 and gamma3  
 D = 0.0798, p-value < 2.2e-16  
 alternative hypothesis: two-sided

As most of the data is at 0, histograms are not very helpful for seeing the differences between the distributions (not to mention the fact there doesn't seem to be a convenient way to plot a histogram with multiple distributions in R, see https://github.com/hadley/ggplot2/issues/1081):

Histograms

Violin plots seem to misrepresent the shape of the distributions (they look a lot more normal-like than they really are) and since the means are very low, the boxes are almost invisible:

Violin plots

Since these plot aren't really showing anything useful, I was wondering if there is a better way to visualise skewed distributions?

Best Answer

Note that your gamma data are not at zero, they are near zero, so you can work with logarithms. If your actual data do contain zeros and you only used skewed gammas to create an example here, you could shift your data by a small $\epsilon$ to make them positive and then take logs.

set.seed(1)
gamma1 <- rgamma(10000, shape=0.05, rate=1)
gamma2 <- rgamma(10000, shape=0.055, rate=0.98)
gamma3 <- rgamma(10000, shape=0.06, rate=0.95)

I see two possibilities. You could use beanplots with logged data, aligning them via the ylim argument.

library(beanplot)
opar <- par(mfrow=c(1,3))
    beanplot(gamma1,what=c(1,1,0,0),log="y",ylim=range(c(gamma1,gamma2,gamma3)),col="grey")
    beanplot(gamma2,what=c(1,1,0,0),log="y",ylim=range(c(gamma1,gamma2,gamma3)),col="grey")
    beanplot(gamma3,what=c(1,1,0,0),log="y",ylim=range(c(gamma1,gamma2,gamma3)),col="grey")
par(opar)

beanplot

Or, if you really want to compare your three distributions, you could do pairwise qq plots on log scales:

pairs(cbind(sort(gamma1),sort(gamma2),sort(gamma3)),log="xy",
  panel=function(x,y,...){points(x,y,pch=19,cex=0.2,...); abline(a=0,b=1)})

pairwise qq plots

The beanplot emphasizes the extreme skewness of all three datasets (note the logs), while the pairwise qq plots emphasize that the datasets are different, while losing the skewness.