The data I'm working with are highly skewed, with the vast majority of data concentrated at 0. It seems really hard to highlight the differences between these kind of distributions:
gamma1 <- rgamma(10000, shape=0.05, rate=1)
gamma2 <- rgamma(10000, shape=0.055, rate=0.98)
gamma3 <- rgamma(10000, shape=0.06, rate=0.95)
c(mean(gamma1), mean(gamma2), mean(gamma3))
[1] 0.04845668 0.05253655 0.05797983
ks.test(gamma1, gamma2)
Two-sample Kolmogorov-Smirnov test
data: gamma1 and gamma2
D = 0.0433, p-value = 1.44e-08
alternative hypothesis: two-sided
ks.test(gamma2, gamma3)
Two-sample Kolmogorov-Smirnov test
data: gamma2 and gamma3
D = 0.0456, p-value = 1.864e-09
alternative hypothesis: two-sided
ks.test(gamma1, gamma3)
Two-sample Kolmogorov-Smirnov test
data: gamma1 and gamma3
D = 0.0798, p-value < 2.2e-16
alternative hypothesis: two-sided
As most of the data is at 0, histograms are not very helpful for seeing the differences between the distributions (not to mention the fact there doesn't seem to be a convenient way to plot a histogram with multiple distributions in R, see https://github.com/hadley/ggplot2/issues/1081):
Violin plots seem to misrepresent the shape of the distributions (they look a lot more normal-like than they really are) and since the means are very low, the boxes are almost invisible:
Since these plot aren't really showing anything useful, I was wondering if there is a better way to visualise skewed distributions?
Best Answer
Note that your gamma data are not at zero, they are near zero, so you can work with logarithms. If your actual data do contain zeros and you only used skewed gammas to create an example here, you could shift your data by a small $\epsilon$ to make them positive and then take logs.
I see two possibilities. You could use beanplots with logged data, aligning them via the
ylim
argument.Or, if you really want to compare your three distributions, you could do pairwise qq plots on log scales:
The beanplot emphasizes the extreme skewness of all three datasets (note the logs), while the pairwise qq plots emphasize that the datasets are different, while losing the skewness.