Solved – Why bother with Benjamini-Hochberg correction

false-discovery-ratestatistical-power

I was recently reading Efron's Simultaneous Inference paper (2008), in which he points out that FDR analysis is robust to splitting the tests into multiple mutually exclusive families of test, performing the analysis on these subsets, and then combining the results (see Sec 5. Are separate analyses legitimate).

This led me to wonder the following: if I have $p$ tests, why not run FDR analysis on each hypothesis individually, and combine the results?

We know that procedures like Benjamini-Hochberg serve to control the FDR. However, it seems that it is overly-conservative, in that the empirical FDR is typically much lower than the control FDR.

Consider the following simple example:

ni = 100
nj = 10000

fdr = vapply(1:ni, function(i) {
  X = matrix(rnorm(ni * nj), ncol = nj)
  pvalues = apply(X, MARGIN = 2, function(x)   
  t.test(x[1:50],x[51:100])$p.value)
  qvalues = p.adjust(pvalues, method = 'BH')
  pfdr = length(which(pvalues < 0.05)) / nj
  qfdr = length(which(qvalues < 0.05)) / nj
  return(c(pfdr,qfdr))
}, numeric(2))

The results follow:
enter image description here

On the left, we see (i) the FDR for 100 replicates where we control FDR on each hypothesis separately; on the right we see (ii) the group control for 100 replicates.

So method (i) closely matches the control rate, while method (ii) (Benjamini-Hochberg) is extremely conservative in this sense.

My question is this: Wouldn't the most powerful controlling procedure be the one where the empirical FDR matches the control rate? Why would we choose anything less 'efficient' than this?


Efron, B (2008), 'Simultaneous inference: when should hypothesis testing problems be combined?', AAS 2(1):197-223.

Best Answer

This is a good question, but you have several concepts confused.

Firstly, to answer your broader question, yes splitting p-values and performing correction on them separately is an often-performed and well known approach when you have prior information about the system you're studying. To see more examples of this and a proof for independent tests, see Lei Sun et al. $^1$ The idea behind this approach is to maintain the same level of FDR control but increase the number of true positives that you find, so everyone wins!

The answer to your actual question (why not take the trivial case where every test is a unique strata) lies in the behavior of the estimators. As detailed in the Sun paper, to use a fixed FDR framework (i.e. keep the same FDR and reject as many tests as possible, leading to an increased number of true positives being found), one must estimate $\pi_0$ with $\hat{\pi}_0$ which has been a contentious issue in the field for many years now. $\pi_0$ is the proportion of true null hypotheses, while $\hat{\pi}_0$ is its estimator. You can read this yourself in the Methods section under Estimating $\pi_0$ and FDR and in the Discussion section if you want. The bias of estimating $\pi_0$ does not increase with the number of strata, but the variance does; so as we increase the number of strata, we have a worse and worse estimator of $\pi_0$, which means we have a worse and worse estimator of $\alpha^{(k)}$, or the p-value needed to reject the null in order to maintain $\gamma$ overall FDR control. If I were to have to guess, and I have no proof of this other than intuition, I would say that the expected value of $\alpha^{k}$ for all $k$ when the number of strata is equal to $k$ would simply be $\gamma$, which negates the purpose of doing the exercise.

I would like to know where you got the notion that FDR is overly cautious; that is very much not the impression that I have. Indeed many people have found that the empirical FDR closely matches the control rate except in circumstances of special kinds of dependencies (see a discussion of that here). What you found in your simulation was not FDR but the false positive rate. You had no true positive associations so your FDR was always 1 by definition, as FDR is the expected value of false positives over all positives. You created a set up where data were randomly generated and you found 1) the P value and 2) the BH corrected P value (the $q$-value is actually a different concept and unique to John Storey's implementation $^2$). You found that 5% of uncorrected P values rejected the null when all were false, which is the meaning of setting $\alpha$ to 0.05 which you did. You also found that 0% of FDR corrected P values rejected the null, which is entirely to be expected as you had no true positives to identify and your results were all within the realm of chance. So really, you found that FDR was doing exactly what it was supposed to do!

[1] Sun L, Craiu RV, Paterson AD and Bull SB (2006). Stratified false discovery control for large-scale hypothesis testing with application to genome-wide association studies. Genetic Epidemiology 30:519-530.

[2] https://projecteuclid.org/euclid.aos/1074290335