Solved – Compare distributions with uneven and small sample sizes

distributionsrsample-size

I will be performing my analysis in R but any general mathematical-based answers will be appreciated.

I realize similar questions have been asked before but I am new to statistics so please excuse my ignorance. I do not know enough to be sure that approaches used in other threads I viewed are valid for my case so I would like to be sure.

Here is the context, simplified for those unfamiliar with this kind of data: I am looking at gene expression data. Essentially, I have a number of categories that each have a set of genes in them. Each gene has an associated expression value (amount present in my sample, log2 transformed). I also have three different test conditions (these conditions are completely independent of one another!). What this amounts to is a distribution. As an example:

Biological Process A:

Contains X genes and their expression levels for Condition 1
Contains Y genes and their expression levels for Condition 2
Contains Z genes and their expression levels for Condition 3

Example showing my data set up

I would like to compare these three distributions. By compare I mean that I want to see if there are any large shifts in either the shape or spread of the distribution. EDIT: as Peter has pointed out, what I am really interested in is 1. Location, 2) Spread and 3) Shape with regards to the three distributions under comparison. For example, if Condition 1 moved such that it was almost entirely below the x-axis and Condition 2 was entirely above it, that would be a very interesting shift. Additionally, if two Conditions went above the axis and one went down, this would also be an interesting shift. Essentially, I need a way to compare the distributions for the biggest changes in mean density above or below the x-axis.

I have about 400+ "Biological Processes". The problem is that for some of them, the sample sizes may vary considerably (genes (n) = 3, 25, 10 for the three conditions respectively would be an extreme example in my case) so I do not think simply comparing the means is a fair approach. Visual comparison will take a very long time with 400 beanplots to look at and I would also like some statistics to back up the conclusions I may be making. I want to essentially say something like "Treating tissue with Condition 1 had the effect of increasing expression for most genes known to be involved in Biological Process A".

Additionally, if this approach is invalid or there is a better way to look at this data I would be open to suggestions. Thank you.

Best Answer

You ask

I would like to compare these three distributions. By compare I mean that I want to see if there are any large shifts in either the shape or spread of the distribution

and your question implies you are also interested in shift of location.

Then there are really three things you want to compare: 1. Location, 2) Spread and 3) Shape.

For 1., you can compare means, medians or other quantiles. There are statistical tests of these, or you could bootstrap. One problem, though, is your sample sizes: They are small and varying. So, as often, you have to look at effect size, not just p value. For comparing 3 means, you can use a linear model (ANOVA/regression) provided the assumptions are met. For comparing quantiles (including the median) you could use quantile regression (fewer assumptions than linear model). Or you could just list them and say "Look!"

For 2, again, there are different measures of spread: Most common are probably standard deviation and inter quartile range. Do either of these appeal?

For 3, things are trickier. You could look at measures of skewness and kurtosis across the three groups, but 1) These don't fully capture "shape" and 2) They are less intuitive than measures of location or spread (in particular, kurtosis doesn't really match intuition for any particular aspect of shape). So, you'd have to say what about the shape you are interested in.

Visual inspection gives you more information; looking at 400 beanplots is a lot; but so is looking at 400 quantile regressions or ANOVAs.

Related Solutions

Solved – Compare two distributions of large sizes and unequal variances where one distribution is heavily skewed

I would not cling on to the fancy statistical treatment of your results. If you see that the distributions are very different, that alone is probably an interesting finding. You can run two-sample Kolmogorov-Smirnov test just to have a statistical analysis to support your observation, which can be shown as histograms. This is the test with null hypothesis that both samples are from the same distribution. I wouldn't focus on its results though, but simply list it as note. The graph should be enough.

Even if the outcomes had the same mean, showing that their distributions are very different can be valuable in some cases. In your case the distributions are clearly separated, so there's no need to make your results "cooler" by an array of statistical tests.

R PCA – Need Help Interpreting Featureplot Legend

Seurat is a very specialized R package, so it's probably best to create an issue on GitHub to ask this question.

In the meantime, I'll show you how to figure out what data is shown in the plot. I don't know anything about cell biology, so it will be up to you to figure out what the data means.

library("Seurat")
library("tidyverse")

# I use a small dataset that comes with the Seurat package.
data("pbmc_small")

# Run UMAP map on first 5 PCs
pbmc_small <- RunUMAP(
  object = pbmc_small,
  dims = 1:5
)

# Generate patchwork of 4 ggplots
p <- FeaturePlot(
  object = pbmc_small,
  features = c("PPBP", "IGLL5"),
  reduction = "umap",
  blend = TRUE
)

# We can extract and look at the plots one by one
p1 <- p[[1]]
p2 <- p[[2]]
p3 <- p[[3]]
# The last plot is the color legend and is not interesting
p4 <- p[[4]]

p1

# The plot has the two UMAP dimensions, UMAP_1 and UMAP_2,
# on the x and y axis and colors the points according to PPBP.
head(p1$data)
#>                  UMAP_1    UMAP_2 ident PPBP
#> ATGCCAGAACGACT 4.692863  1.759652     0    0
#> CATGGCCTGTGCAT 5.494108  1.453728     0    0
#> GAACCTGATGAACC 2.188469 -5.069190     0    4
#> TGACTGGATTCTCA 4.183846  3.815155     0    0
#> AGTCAGACTGCACA 4.731087  1.388607     0    0
#> TCTGATACACGTGT 4.880636  1.954429     0    0

# What does `PPBP` mean? I have no idea but I'd guess
# it's a scale for gene expression:
# * No expression (PPBP = 0) in 67 cells.
# * Medium expression (PPBP = 4) in 3 cells.
# * High expression (PPBP = 9) in 10 cells.
p1$data %>%
  count(PPBP)
#>   PPBP  n
#> 1    0 67
#> 2    4  3
#> 3    9 10

^{Created on 2022-04-14 by the reprex package (v2.0.1)}

Best Answer

Related Solutions

Solved – Compare two distributions of large sizes and unequal variances where one distribution is heavily skewed

R PCA – Need Help Interpreting Featureplot Legend

Related Question