Biostatistics – Importance of Poisson and Negative Binomial Models in RNA-seq Data Analysis

bioinformaticsbiostatistics

I am a biologist and use different packages like DESeq, … to normalize my data and find deferential expressed genes.
Recently I have started to learn probability and statistics and I have studied distributions quite well. But I still have a problem: I think I do not very well understand why we really use this distributions to infer
expression levels for genes, normalization, find differential expressed genes?

Why do we need e.g. Poisson model, negative binomial, … for obtaining an approximate expression level?
or in a package called mmseq: "Expression levels are inferred for each transcript using the mmseq program by modelling mappings of reads or read pairs (fragments) to sets of transcripts"!! why modeling?
why do we need to estimate expression level while we can directly count the number of reads per gene?

Or why is it appropriate to model read counts as a e.g. Poisson
process?

Is it only due to the fact that knowing the distribution (e.g negative binomial which can very well explain the observed counts, considering noise, …) help us to apply the right properties like mean, var, …, on data or there are more things to learn from the distributions?

Sorry if my question is primitive but it is a long time that I am struggling with that

Best Answer

  1. The data is count data because it's the number of counts aligned to a gene. It's not continuous and therefore can't be modelled as say a normal distribution.
  2. Poisson distribution is designed for modelling count data.
  3. However, the Poisson distribution assumes the first and second moments (mean and variance) are equal. This is not true for RNA-Seq. Lowly expressed genes have much higher variance than highly expressed genes.
  4. To account for the variability, we use the negative-binomial model which is really an extension of Poisson. The NB model has an extra parameter to model for the variance. It can be proven as the variance approach to the mean, the NB model becomes the Poisson model.

EDIT

To answer your comments:

  1. Normalization is usually necessary to model the different sequencing depth between libraries. However, you don't need to do it yourself if you use DESeq2 or edgeR. They have their own normalization algorithm (Trim-mean-valued and upper-quartiles).

  2. Those packages do the normalization for you. Fit your data to a NB model, estimate the dispersion (ie: variance). Once they have a model, they can use whatever statistical method required (I think it's the Wald test for DESeq2) to check for differential expressed genes. The results depend on how much they express and also how much variance they have.