Benjamini and Hochberg (1995) introduced the false discovery rate. Benjamini and Yekutieli (2001) proved that the estimator is valid under some forms of dependence. Dependence can arise as follows. Consider the continuous variable used in a t-test and another variable correlated with it; for example, testing if BMI differs in two groups and if waist circumference differs in these two groups. Because these variables are correlated, the resulting p-values will also be correlated. Yekutieli and Benjamini (1999) developed another FDR controlling procedure, which can be used under general dependence by resampling the null distribution. Because the comparison is with respect to the null permutation distribution, as the total number of true positives increases, the method becomes more conservative. It turns out that BH 1995 is also conservative as the number of true positives increases. To improve this, Benjamini and Hochberg (2000) introduced the adaptive FDR procedure. This required estimation of a parameter, the null proportion, which is also used in Storey's pFDR estimator. Storey gives comparisons and argues that his method is more powerful and emphasizes the conservative nature of 1995 procedure. Storey also has results and simulations under dependence.
All of the above tests are valid under independence. The question is what kind of departure from independence can these estimates deal with.
My current thinking is that if you don't expect too many true positives the BY (1999) procedure is nice because it incorporates distributional features and dependence. However, I'm unaware of an implementation. Storey's method was designed for many true positives with some dependence. BH 1995 offers an alternative to the family-wise error rate and it is still conservative.
Benjamini, Y and Y Hochberg. On the Adaptive Control of the False Discovery Rate in Multiple Testing with Independent Statistics. Journal of Educational and Behavioral Statistics, 2000.
"Multiple comparisons" is the name attached to the general problem of making decisions based on the results of more than one test. The nature of the problem is made clear by the famous XKCD "Green jelly bean" cartoon in which investigators performed hypothesis tests of associations between consumption of jelly beans (of 20 different colors) and acne. One test reported a p-value less than $1/20$, leading to the conclusion that "green jelly beans cause acne." The joke is that p-values, by design, have a $1/20$ chance of being less than $1/20$, so intuitively we would expect to see a p-value that low among $20$ different tests.
What the cartoon does not say is whether the $20$ tests were based on separate datasets or one dataset.
With separate datasets, each of the $20$ results has a $1/20$ chance of being "significant." Basic properties of probabilities (of independent events) then imply that the chance all $20$ results are "insignificant" is $(1-0.05)^{20}\approx 0.36$. The remaining chance of $1-0.36 = 0.64$ is large enough to corroborate our intuition that a single "significant" result in this large group of results is no surprise; no cause can validly be assigned to such a result except the operation of chance.
If the $20$ results were based on a common dataset, however, the preceding calculation would be erroneous: it assumes all $20$ outcomes were statistically independent. But why wouldn't they be? Analysis of Variance provides a standard example: when comparing two or more treatment groups against a control group, each comparison involves the same control results. The comparisons are not independent. Now, for instance, "significant" differences could arise due to chance variation in the controls. Such variation could simultaneously change the comparisons with every group.
(ANOVA handles this problem by means of its overall F-test. It is sort of a comparison "to rule them all": we will not trust group-to-group comparison unless first this F-test is significant.)
We can abstract the essence of this situation with the following framework. Multiple comparisons concerns making a decision from the p-values $(p_1, p_2, \ldots, p_n)$ of $n$ distinct tests. Those p-values are random variables. Assuming all the corresponding null hypotheses are logically consistent, each should have a uniform distribution. When we know their joint distribution, we can construct reasonable ways to combine all $n$ of them into a single decision. Otherwise, the best we can usually do is rely on approximate bounds (which is the basis of the Bonferroni correction, for instance).
Joint distributions of independent random variables are easy to compute. The literature therefore distinguishes between this situation and the case of non-independence.
Accordingly, the correct meaning of "independent" in the quotations is in the usual statistical sense of independent random variables.
Note that an assumption was needed to arrive at this conclusion: namely, that all $n$ of the null hypotheses are logically consistent. As an example of what is being avoided, consider conducting two tests with a batch of univariate data $(x_1, \ldots, x_m)$ assumed to be a random sample from a Normal distribution of unknown mean $\mu$. The first is a t-test of $\mu=0$, with p-value $p_1$, and the second is a t-test of $\mu=1$, with p-value $p_2$. Since both cannot logically hold simultaneously, it would be problematic to talk about "the null distribution" of $(p_1, p_2)$. In this case there can be no such thing at all! Thus the very concept of statistical independence sometimes cannot even apply.
Best Answer
This would obviously be an absolute nightmare to do in practice, but suppose it could be done: we appoint a Statistical Sultan and everyone running a hypothesis test reports their raw $p$-values to this despot. He performs some kind of global (literally) multiple comparisons correction and replies with the corrected versions.
Would this usher in a golden age of science and reason? No, probably not.
Let's start by considering one pair of hypotheses, as in a $t$-test. We measure some property of two groups and want to distinguish between two hypotheses about that property: $$\begin{align} H_0:& \textrm{ The groups have the same mean.} \\ H_A:& \textrm{ The groups have different means.} \end{align}$$ In a finite sample, the means are unlikely to be exactly equal even if $H_0$ really is true: measurement error and other sources of variability can push individual values around. However, the $H_0$ hypothesis is in some sense "boring", and researchers are typically concerned with avoiding a "false positive" situation wherein they claim to have found a difference between the groups where none really exists. Therefore, we only call results "significant" if they seem unlikely under the null hypothesis, and, by convention, that unlikeliness threshold is set at 5%.
This applies to a single test. Now suppose you decide to run multiple tests and are willing to accept a 5% chance of mistakenly accepting $H_0$ for each one. With enough tests, you therefore almost certainly going to start making errors, and lots of them.
The various multiple corrections approaches are intended to help you get back to a nominal error rate that you have already chosen to tolerate for individual tests. They do so in slightly different ways. Methods that control the Family-Wise Error Rate, like the Bonferroni, Sidak, and Holm procedures, say "You wanted a 5% chance of making an error on a single test, so we'll ensure that you there's no more than a 5% chance of making any errors across all of your tests." Methods that control the False Discovery Rate instead say "You are apparently okay with being wrong up to 5% of the time with a single test, so we'll ensure that no more than 5% of your 'calls' are wrong when doing multiple tests". (See the difference?)
Now, suppose you attempted to control the family-wise error rate of all hypothesis tests ever run. You are essentially saying that you want a <5% chance of falsely rejecting any null hypothesis, ever. This sets up an impossibly stringent threshold and inference would be effectively useless but there's an even more pressing issue: your global correction means you are testing absolutely nonsensical "compound hypotheses" like
$$\begin{align} H_1: &\textrm{Drug XYZ changes T-cell count } \wedge \\ &\textrm{Grapes grow better in some fields } \wedge&\\ &\ldots \wedge \ldots \wedge \ldots \wedge \ldots \wedge \\&\textrm{Men and women eat different amounts of ice cream} \end{align} $$
With False Discovery Rate corrections, the numerical issue isn't quite so severe, but it is still a mess philosophically. Instead, it makes sense to define a "family" of related tests, like a list of candidate genes during a genomics study, or a set of time-frequency bins during a spectral analysis. Tailoring your family to a specific question lets you actually interpret your Type I error bound in a direct way. For example, you could look at a FWER-corrected set of p-values from your own genomic data and say "There's a <5% chance that any of these genes are false positives." This is a lot better than a nebulous guarantee that covers inferences done by people you don't care about on topics you don't care about.
The flip side of this is that he appropriate choice of "family" is debatable and a bit subjective (Are all genes one family or can I just consider the kinases?) but it should be informed by your problem and I don't believe anyone has seriously advocated defining families nearly so extensively.
How about Bayes?
Bayesian analysis offers coherent alternative to this problem--if you're willing to move a bit away from the Frequentist Type I/Type II error framework. We start with some non-committal prior over...well...everything. Every time we learn something, that information is combined with the prior to generate a posterior distribution, which in turn becomes the prior for the next time we learn something. This gives you a coherent update rule and you could compare different hypotheses about specific things by calculating the Bayes factor between two hypotheses. You could presumably factor out large chunks of the model, which wouldn't even make this particularly onerous.
There is a persistent...meme that Bayesian methods don't require multiple comparisons corrections. Unfortunately, the posterior odds are just another test statistic for frequentists (i.e., people who care about Type I/II errors). They don't have any special properties that control these types of errors (Why would they?) Thus, you're back in intractable territory, but perhaps on slightly more principled ground.
The Bayesian counter-argument is that we should focus on what we can know now and thus these error rates aren't as important.
On Reproduciblity
You seem to be suggesting that improper multiple comparisons-correction is the reason behind a lot of incorrect/unreproducible results. My sense is that other factors are more likely to be an issue. An obvious one is that pressure to publish leads people to avoid experiments that really stress their hypothesis (i.e., bad experimental design).
For example, [in this experiment] (part of Amgen's (ir)reproduciblity initative 6, it turns out that the mice had mutations in genes other than the gene of interest. Andrew Gelman also likes to talk about the Garden of Forking Paths, wherein researchers choose a (reasonable) analysis plan based on the data, but might have done other analyses if the data looked different. This inflates $p$-values in a similar way to multiple comparisons, but is much harder to correct for afterward. Blatantly incorrect analysis may also play a role, but my feeling (and hope) is that that is gradually improving.