Solved – Compare distributions with uneven and small sample sizes

distributionsrsample-size

I will be performing my analysis in R but any general mathematical-based answers will be appreciated.

I realize similar questions have been asked before but I am new to statistics so please excuse my ignorance. I do not know enough to be sure that approaches used in other threads I viewed are valid for my case so I would like to be sure.

Here is the context, simplified for those unfamiliar with this kind of data: I am looking at gene expression data. Essentially, I have a number of categories that each have a set of genes in them. Each gene has an associated expression value (amount present in my sample, log2 transformed). I also have three different test conditions (these conditions are completely independent of one another!). What this amounts to is a distribution. As an example:

Biological Process A:

  • Contains X genes and their expression levels for Condition 1
  • Contains Y genes and their expression levels for Condition 2
  • Contains Z genes and their expression levels for Condition 3

Example showing my data set up

I would like to compare these three distributions. By compare I mean that I want to see if there are any large shifts in either the shape or spread of the distribution. EDIT: as Peter has pointed out, what I am really interested in is 1. Location, 2) Spread and 3) Shape with regards to the three distributions under comparison. For example, if Condition 1 moved such that it was almost entirely below the x-axis and Condition 2 was entirely above it, that would be a very interesting shift. Additionally, if two Conditions went above the axis and one went down, this would also be an interesting shift. Essentially, I need a way to compare the distributions for the biggest changes in mean density above or below the x-axis.

I have about 400+ "Biological Processes". The problem is that for some of them, the sample sizes may vary considerably (genes (n) = 3, 25, 10 for the three conditions respectively would be an extreme example in my case) so I do not think simply comparing the means is a fair approach. Visual comparison will take a very long time with 400 beanplots to look at and I would also like some statistics to back up the conclusions I may be making. I want to essentially say something like "Treating tissue with Condition 1 had the effect of increasing expression for most genes known to be involved in Biological Process A".

Additionally, if this approach is invalid or there is a better way to look at this data I would be open to suggestions. Thank you.

Best Answer

You ask

I would like to compare these three distributions. By compare I mean that I want to see if there are any large shifts in either the shape or spread of the distribution

and your question implies you are also interested in shift of location.

Then there are really three things you want to compare: 1. Location, 2) Spread and 3) Shape.

For 1., you can compare means, medians or other quantiles. There are statistical tests of these, or you could bootstrap. One problem, though, is your sample sizes: They are small and varying. So, as often, you have to look at effect size, not just p value. For comparing 3 means, you can use a linear model (ANOVA/regression) provided the assumptions are met. For comparing quantiles (including the median) you could use quantile regression (fewer assumptions than linear model). Or you could just list them and say "Look!"

For 2, again, there are different measures of spread: Most common are probably standard deviation and inter quartile range. Do either of these appeal?

For 3, things are trickier. You could look at measures of skewness and kurtosis across the three groups, but 1) These don't fully capture "shape" and 2) They are less intuitive than measures of location or spread (in particular, kurtosis doesn't really match intuition for any particular aspect of shape). So, you'd have to say what about the shape you are interested in.

Visual inspection gives you more information; looking at 400 beanplots is a lot; but so is looking at 400 quantile regressions or ANOVAs.

Related Question