Solved – Help in understanding how to apply correctly False Discovery Rate adjustment

false-discovery-ratehypothesis testingp-valuer

I am trying to understand how to correctly apply False Discovery Rate when comparing multiple hypothesis tests. Although I use here R code, my doubts are about the procedure, rather than the programming.

I built a toy model in R made of 10000 hypothesis (e.g. genes expression)
made over two 5 samples populations:

set.seed(620)
x = matrix(rnorm(10000*5),nrow=10000)
y = matrix(rnorm(10000*5),nrow=10000)

With these datasets x and y I know that all the null hypothesis are true.

I now evaluate the p-values:

p = sapply(1:10000, function(i) t.test(x[i,],y[i,])$p.val)

As expected the number of p-values below 0.05 (or any other number) is 453, i.e. about 5% false positives as expected.

Next I adjust the p-values using False Discovery Rate adjustment and estimate the q-values:

q = p.adjust(p, method = "fdr")

Now, if I understood correctly, selecting the hypothesis that have a q value of 0.05 one should get 5% of false discoveries (number of false positives divided by the number of discoveries).

The number of hypothesis with q < 0.05 is 0. I think that it might be because, since all the null hypothesis are true, no matter how I choose q the false discoveries will always be 100% among the discoveries (this is also how I explain to myself that most of the q-values are close to 1).

Next I substituted the last one hundred rows of y with numbers sampled by a normal distribution with mean 3, and estimate the p and the q-values:

y[9901:10000,] = rnorm(500, mean=3) 
p = sapply(1:10000, function(i) t.test(x[i,],y[i,])$p.val)
q = p.adjust(p, method = "fdr")

After these modifications, the number of p-values < 0.05 increases to 544 and 98 out of the 100 hypothesis that should be rejected are detected.

However the number of hypothesis having q-values < 0.05 is surprisingly low: only 9. They are all hypothesis that should be rejected, so it seems to me that
the false discovery rate has been kept to 0 rather than to 0.05.

If I accept, for example, the hypothesis with q-value = 0.5 I end up with accepting 95 hypothesis. Of these 95, 67 are true discoveries and 28 are false discoveries. Therefore the FDR is of 28/95 = 0.3 and not 0.5 as I would expect.

Is there something I have not understood properly?
Why are the result I get so different from the one I would expect theoretically?

Best Answer

Note that Benjamini-Hochberg controls the false discovery rate to be less than or equal to the specified level, which has been satisfied by all of your scenarios.

In terms of the procedure you have followed, there are no faults there. However, it is worth pointing out that for real gene expression data you may get better results using a specific method such as Storey, John D., and Robert Tibshirani. "Statistical significance for genomewide studies." Proceedings of the National Academy of Sciences 100, no. 16 (2003): 9440-9445. This particular method exploits the fact that in such studies it is virtually impossible that the number of significant genes is zero.

Related Question