Solved – Bootstrapping of RNA-Seq data: normal distribution

bioinformaticsbiostatisticsbootstrapr

I have an RNA_Seq dataset (FPKM values) obtained from the ENCODE Caltech datasets for K562 cells; I also have 6 different lists of genes and I want to show that the expression values of my panel of genes is not random. Therefore, I decided to use R and perform bootstrapping.

Bootstrapping, as far as I know, requires normally distributed data. My RNA-Seq data are not normally distributed and they are strongly skewed on the left.

I used BoxCox transformation:

box.cox.powers(RNA_Seq[,2])

to normalize them.

I obtained a negative value (-0.03701075 to be precise) and raised my FPKM values to that. I then used:

scale()

to obtain the z-scores and see if the z-scores of my genes are bigger/smaller than +/-1.96.
that's because I am interested to see if my panel of genes is more highly expressed -or it has a lower expression- than what you would expect by chance.

However, the values with the highest FPKM became those with the lowest z-score and vice versa since I elevated to a negative number. I understand why this happens mathematically; however, I don't really know how to handle the bootstrap now because the data show the exact opposite of what I would reasonably expect!

So, how would you guys handle this situation to test my genes of interest?
I looked up online but I could not find anything specific for RNA-Seq, or I didn't know how to apply it to RNA-Seq data.

I asked this question on Stack Overflow but it was marked as off-topic and I should have posted it here. So here we go 🙂

Best Answer

ENCODE Caltech dataset contains two replicates for K562 cell line. Is that the experiment you are using? And how did you select the gene lists that you currently have at hand? Further, do you want to test all the genes in the list as a "set", or do you want to have a single p-value for each gene?

It is very common to select the gene lists on the basis of some statistical test. For example, R add-on packages edgeR, DESeq and limma (from the Bioconductor project) offer suitable methods for rna-seq data, and especially for small sample sizes. There are also other ways to do this, such as, simple filtering based on FPKM values or their standard deviation. Using some statistical test to arrive to list of differentially expressed genes will also simultaneously give you a p-value for each gene, which might be what you are after. Now that you have a single condition only (just one cell line, K562), the statistical test is equivalent to testing whether the gene's expression is different from 1, or 0 if the FPKM values are also log-transformed.

Each gene's statistical significance can also be tested by using a permutation test, where usually the sample labels are shuffled a large number of times. This approach is not really applicable here, since there are only two samples. For more information on a possible implementation, see this page.

In addition, it is also possible to test the whole genelist as a set, and possibly compare it to all the other genes in the experiment. This can be accomplished with, e.g., the package globaltest (from the Bioconductor project) in R.

Related Solutions

Solved – Calculating the probability of gene list overlap between an RNA seq and a ChIP-chip data set

You are close, with your use of dhyper and phyper, but I don't understand where 0:2 and -1:2 are coming from.

The p-value you want is the probability of getting 100 or more white balls in a sample of size 400 from an urn with 3000 white balls and 12000 black balls. Here are four ways to calculate it.

sum(dhyper(100:400, 3000, 12000, 400))
1 - sum(dhyper(0:99, 3000, 12000, 400))
phyper(99, 3000, 12000, 400, lower.tail=FALSE)
1-phyper(99, 3000, 12000, 400)

These give 0.0078.

dhyper(x, m, n, k) gives the probability of drawing exactly x. In the first line, we sum up the probabilities for 100 – 400; in the second line, we take 1 minus the sum of the probabilities of 0 – 99.

phyper(x, m, n, k) gives the probability of getting x or fewer, so phyper(x, m, n, k) is the same as sum(dhyper(0:x, m, n, k)).

The lower.tail=FALSE is a bit confusing. phyper(x, m, n, k, lower.tail=FALSE) is the same as 1-phyper(x, m, n, k), and so is the probability of x+1 or more. [I never remember this and so always have to double check.]

At that stattrek.com site, you want to look at the last row, "Cumulative Probability: P(X $\ge$ 100)," rather than the first row "Hypergeometric Probability: P(X = 100)."

Any particular number that you draw is going to have small probability (in fact, max(dhyper(0:400, 3000, 12000, 400)) gives $\sim$0.050), and getting 101 or 102 or any larger number is even more interesting that 100, and the p-value is the probability, if the null hypothesis were true, of getting a result as interesting or more so than what was observed.

Here's a picture of the hypergeometric distribution in this case. You can see that it's centered at 80 (20% of 400) and that 100 is pretty far out in the right tail. enter image description here

Solved – TMM-normalization of RNA-seq data in R language using edgeR package

Well, your function doesn't entirely make sense as written, depending as it does on an undefined global variable ah.

Assuming that M is a matrix of counts, the edgeR User's Guide advises you to use:

dge <- DGEList(M)
dge <- calcNormFactors(dge)
logCPM <- cpm(dge, log=TRUE)

if your aim is to get normalized quantities for plotting etc.

The User's Guide advises you not to use equalizeLibSizes.

Best Answer

Related Solutions

Solved – Calculating the probability of gene list overlap between an RNA seq and a ChIP-chip data set

Solved – TMM-normalization of RNA-seq data in R language using edgeR package

Related Question