Biostatistics – Importance of Poisson and Negative Binomial Models in RNA-seq Data Analysis

bioinformaticsbiostatistics

I am a biologist and use different packages like DESeq, … to normalize my data and find deferential expressed genes.
Recently I have started to learn probability and statistics and I have studied distributions quite well. But I still have a problem: I think I do not very well understand why we really use this distributions to infer
expression levels for genes, normalization, find differential expressed genes?

Why do we need e.g. Poisson model, negative binomial, … for obtaining an approximate expression level?
or in a package called mmseq: "Expression levels are inferred for each transcript using the mmseq program by modelling mappings of reads or read pairs (fragments) to sets of transcripts"!! why modeling?
why do we need to estimate expression level while we can directly count the number of reads per gene?

Or why is it appropriate to model read counts as a e.g. Poisson
process?

…

Is it only due to the fact that knowing the distribution (e.g negative binomial which can very well explain the observed counts, considering noise, …) help us to apply the right properties like mean, var, …, on data or there are more things to learn from the distributions?

Sorry if my question is primitive but it is a long time that I am struggling with that

Best Answer

The data is count data because it's the number of counts aligned to a gene. It's not continuous and therefore can't be modelled as say a normal distribution.
Poisson distribution is designed for modelling count data.
However, the Poisson distribution assumes the first and second moments (mean and variance) are equal. This is not true for RNA-Seq. Lowly expressed genes have much higher variance than highly expressed genes.
To account for the variability, we use the negative-binomial model which is really an extension of Poisson. The NB model has an extra parameter to model for the variance. It can be proven as the variance approach to the mean, the NB model becomes the Poisson model.

EDIT

To answer your comments:

Normalization is usually necessary to model the different sequencing depth between libraries. However, you don't need to do it yourself if you use DESeq2 or edgeR. They have their own normalization algorithm (Trim-mean-valued and upper-quartiles).
Those packages do the normalization for you. Fit your data to a NB model, estimate the dispersion (ie: variance). Once they have a model, they can use whatever statistical method required (I think it's the Wald test for DESeq2) to check for differential expressed genes. The results depend on how much they express and also how much variance they have.

Related Solutions

Solved – Calculating the probability of gene list overlap between an RNA seq and a ChIP-chip data set

You are close, with your use of dhyper and phyper, but I don't understand where 0:2 and -1:2 are coming from.

The p-value you want is the probability of getting 100 or more white balls in a sample of size 400 from an urn with 3000 white balls and 12000 black balls. Here are four ways to calculate it.

sum(dhyper(100:400, 3000, 12000, 400))
1 - sum(dhyper(0:99, 3000, 12000, 400))
phyper(99, 3000, 12000, 400, lower.tail=FALSE)
1-phyper(99, 3000, 12000, 400)

These give 0.0078.

dhyper(x, m, n, k) gives the probability of drawing exactly x. In the first line, we sum up the probabilities for 100 – 400; in the second line, we take 1 minus the sum of the probabilities of 0 – 99.

phyper(x, m, n, k) gives the probability of getting x or fewer, so phyper(x, m, n, k) is the same as sum(dhyper(0:x, m, n, k)).

The lower.tail=FALSE is a bit confusing. phyper(x, m, n, k, lower.tail=FALSE) is the same as 1-phyper(x, m, n, k), and so is the probability of x+1 or more. [I never remember this and so always have to double check.]

At that stattrek.com site, you want to look at the last row, "Cumulative Probability: P(X $\ge$ 100)," rather than the first row "Hypergeometric Probability: P(X = 100)."

Any particular number that you draw is going to have small probability (in fact, max(dhyper(0:400, 3000, 12000, 400)) gives $\sim$0.050), and getting 101 or 102 or any larger number is even more interesting that 100, and the p-value is the probability, if the null hypothesis were true, of getting a result as interesting or more so than what was observed.

Here's a picture of the hypergeometric distribution in this case. You can see that it's centered at 80 (20% of 400) and that 100 is pretty far out in the right tail. enter image description here

Solved – Bootstrapping of RNA-Seq data: normal distribution

ENCODE Caltech dataset contains two replicates for K562 cell line. Is that the experiment you are using? And how did you select the gene lists that you currently have at hand? Further, do you want to test all the genes in the list as a "set", or do you want to have a single p-value for each gene?

It is very common to select the gene lists on the basis of some statistical test. For example, R add-on packages edgeR, DESeq and limma (from the Bioconductor project) offer suitable methods for rna-seq data, and especially for small sample sizes. There are also other ways to do this, such as, simple filtering based on FPKM values or their standard deviation. Using some statistical test to arrive to list of differentially expressed genes will also simultaneously give you a p-value for each gene, which might be what you are after. Now that you have a single condition only (just one cell line, K562), the statistical test is equivalent to testing whether the gene's expression is different from 1, or 0 if the FPKM values are also log-transformed.

Each gene's statistical significance can also be tested by using a permutation test, where usually the sample labels are shuffled a large number of times. This approach is not really applicable here, since there are only two samples. For more information on a possible implementation, see this page.

In addition, it is also possible to test the whole genelist as a set, and possibly compare it to all the other genes in the experiment. This can be accomplished with, e.g., the package globaltest (from the Bioconductor project) in R.

Best Answer

Related Solutions

Solved – Calculating the probability of gene list overlap between an RNA seq and a ChIP-chip data set

Solved – Bootstrapping of RNA-Seq data: normal distribution

Related Question