Multiple Comparisons – Plain Language Meaning of Dependent and Independent Tests

false-discovery-ratefamilywise-errorindependenceintuitionmultiple-comparisons

In both the family-wise error rate (FWER) and false discovery rate (FDR) literature, particular methods of controlling FWER or FDR are said to be appropriate to dependent or independent tests. For example, in the 1979 paper "A Simple Sequentially Rejective Multiple Test Procedure", Holm wrote to contrast his step-up Šidák method versus his step-up Bonferroni control method:

The same computational simplicity is obtained when the test statistics are independent.

In "Controlling the False Discovery Rate" by Benjamini and Hochberg (1995), the authors write:

Theorem 1. For independent test statistics and for any configuration of false
null hypotheses, the above procedure controls the FDR at $q^{*}$.

Later, in 2001, Benjamini and Yekutieli write:

1.3. The problem. When trying to use the FDR approach in practice, dependent test statistics are encountered more often than independent ones, the multiple endpoints example of the above being a case in point.

Which particular meanings of dependent an independent are these authors using? I would be happy for formal definitions of what makes tests dependent or independent from one another if they accompany a plain language explanation.

I can think of a few different possible meanings, but I don't quite grok which, if any, they might be:

  • "Dependent" means multivariate tests (i.e. many dependent variables with the same or similar predictors); independent means univariate tests (i.e. many independent variables, one dependent variable).

  • "Dependent" means tests based on paired/matched subjects (e.g. paired t test, repeated measures ANOVA, etc.); "independent" means a unpaired/independent samples study designs.

  • "Dependent" means that the probability that a test is rejected is correlated with the probability that another test is rejected, and "positive dependency" means that this correlation is positive; "independent" means rejection probabilities are uncorrelated.

References
Benjamini, Y. and Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289–300.

Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29(4):1165–1188.

Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(65-70):1979.

Best Answer

"Multiple comparisons" is the name attached to the general problem of making decisions based on the results of more than one test. The nature of the problem is made clear by the famous XKCD "Green jelly bean" cartoon in which investigators performed hypothesis tests of associations between consumption of jelly beans (of 20 different colors) and acne. One test reported a p-value less than $1/20$, leading to the conclusion that "green jelly beans cause acne." The joke is that p-values, by design, have a $1/20$ chance of being less than $1/20$, so intuitively we would expect to see a p-value that low among $20$ different tests.

What the cartoon does not say is whether the $20$ tests were based on separate datasets or one dataset.

With separate datasets, each of the $20$ results has a $1/20$ chance of being "significant." Basic properties of probabilities (of independent events) then imply that the chance all $20$ results are "insignificant" is $(1-0.05)^{20}\approx 0.36$. The remaining chance of $1-0.36 = 0.64$ is large enough to corroborate our intuition that a single "significant" result in this large group of results is no surprise; no cause can validly be assigned to such a result except the operation of chance.

If the $20$ results were based on a common dataset, however, the preceding calculation would be erroneous: it assumes all $20$ outcomes were statistically independent. But why wouldn't they be? Analysis of Variance provides a standard example: when comparing two or more treatment groups against a control group, each comparison involves the same control results. The comparisons are not independent. Now, for instance, "significant" differences could arise due to chance variation in the controls. Such variation could simultaneously change the comparisons with every group.

(ANOVA handles this problem by means of its overall F-test. It is sort of a comparison "to rule them all": we will not trust group-to-group comparison unless first this F-test is significant.)

We can abstract the essence of this situation with the following framework. Multiple comparisons concerns making a decision from the p-values $(p_1, p_2, \ldots, p_n)$ of $n$ distinct tests. Those p-values are random variables. Assuming all the corresponding null hypotheses are logically consistent, each should have a uniform distribution. When we know their joint distribution, we can construct reasonable ways to combine all $n$ of them into a single decision. Otherwise, the best we can usually do is rely on approximate bounds (which is the basis of the Bonferroni correction, for instance).

Joint distributions of independent random variables are easy to compute. The literature therefore distinguishes between this situation and the case of non-independence.

Accordingly, the correct meaning of "independent" in the quotations is in the usual statistical sense of independent random variables.


Note that an assumption was needed to arrive at this conclusion: namely, that all $n$ of the null hypotheses are logically consistent. As an example of what is being avoided, consider conducting two tests with a batch of univariate data $(x_1, \ldots, x_m)$ assumed to be a random sample from a Normal distribution of unknown mean $\mu$. The first is a t-test of $\mu=0$, with p-value $p_1$, and the second is a t-test of $\mu=1$, with p-value $p_2$. Since both cannot logically hold simultaneously, it would be problematic to talk about "the null distribution" of $(p_1, p_2)$. In this case there can be no such thing at all! Thus the very concept of statistical independence sometimes cannot even apply.