Solved – Bootstrapping of RNA-Seq data: normal distribution

bioinformaticsbiostatisticsbootstrapr

I have an RNA_Seq dataset (FPKM values) obtained from the ENCODE Caltech datasets for K562 cells; I also have 6 different lists of genes and I want to show that the expression values of my panel of genes is not random. Therefore, I decided to use R and perform bootstrapping.

Bootstrapping, as far as I know, requires normally distributed data. My RNA-Seq data are not normally distributed and they are strongly skewed on the left.

I used BoxCox transformation:

box.cox.powers(RNA_Seq[,2])

to normalize them.

I obtained a negative value (-0.03701075 to be precise) and raised my FPKM values to that. I then used:

scale()

to obtain the z-scores and see if the z-scores of my genes are bigger/smaller than +/-1.96.
that's because I am interested to see if my panel of genes is more highly expressed -or it has a lower expression- than what you would expect by chance.

However, the values with the highest FPKM became those with the lowest z-score and vice versa since I elevated to a negative number. I understand why this happens mathematically; however, I don't really know how to handle the bootstrap now because the data show the exact opposite of what I would reasonably expect!

So, how would you guys handle this situation to test my genes of interest?
I looked up online but I could not find anything specific for RNA-Seq, or I didn't know how to apply it to RNA-Seq data.

I asked this question on Stack Overflow but it was marked as off-topic and I should have posted it here. So here we go 🙂

Best Answer

ENCODE Caltech dataset contains two replicates for K562 cell line. Is that the experiment you are using? And how did you select the gene lists that you currently have at hand? Further, do you want to test all the genes in the list as a "set", or do you want to have a single p-value for each gene?

It is very common to select the gene lists on the basis of some statistical test. For example, R add-on packages edgeR, DESeq and limma (from the Bioconductor project) offer suitable methods for rna-seq data, and especially for small sample sizes. There are also other ways to do this, such as, simple filtering based on FPKM values or their standard deviation. Using some statistical test to arrive to list of differentially expressed genes will also simultaneously give you a p-value for each gene, which might be what you are after. Now that you have a single condition only (just one cell line, K562), the statistical test is equivalent to testing whether the gene's expression is different from 1, or 0 if the FPKM values are also log-transformed.

Each gene's statistical significance can also be tested by using a permutation test, where usually the sample labels are shuffled a large number of times. This approach is not really applicable here, since there are only two samples. For more information on a possible implementation, see this page.

In addition, it is also possible to test the whole genelist as a set, and possibly compare it to all the other genes in the experiment. This can be accomplished with, e.g., the package globaltest (from the Bioconductor project) in R.

Related Question