Hypothesis Testing – What’s Wrong with Multiple Testing Correction Compared to Joint Tests?

bonferronihypothesis testingmultiple-comparisons

I am wondering why it is said that multiple testing corrections are ''arbitrary'' and that they are based on a incoherent philosophy that

the veracity of one statement depends on which other hypotheses are entertained

see e.g. answers and comments to What's wrong with Bonferroni adjustments? and in particular the discussion between @FrankHarrell and @Bonferroni.

Let us (for simplicity and for the ease of the exposition) assume that we have two (independent) normal populations, independent and with known standard deviations but unknown means. Let (just as an example) say that these standard deviations are resp. $\sigma_1=2, \sigma_2=3$.

Joint test

Assume we want to test the hypothesis $H_0: \mu_1 = 2 \& \mu_2=2$ versus $H_1: \mu_1 \ne 2 | \mu_2 \ne 2$ at a significance level of $\alpha=0.05$ (the symbol $\&$ meaning 'and' while $|$ means 'or').

We also have a random outcome $x_1$ from the first population and $x_2$ from the second population.

if $H_0$ is true then the first random variable $X_1 \sim N(\mu_1=2,\sigma_1=2)$ and the second one $X_2 \sim N(\mu_2=2,\sigma_2=3)$ as we assumed independence it holds that the random variable $X^2 = \frac{(X_1-\mu_1)^2}{\sigma_1^2} + \frac{(X_2-\mu_2)^2}{\sigma_2^2}$ is $\chi^2$ with $df=2$. We can use this $X^2$ as a test statistic and we will accept $H_0$ if, for the observed outcomes $x_1$ and $x_2$ it holds that $\frac{(x_1-\mu_1)^2}{\sigma_1^2} + \frac{(x_2-\mu_2)^2}{\sigma_2^2} \le \chi^2_\alpha$. In other words the acceptance region for this test is an ellipse centered at $(\mu_1, \mu_2)$ and we have a density mass of $1-\alpha$ ''on top'' of this ellipse.

Multiple tests

With multiple testing we will do two independent tests and ''adjust'' the significance level. So we will perform two independent tests $H_0^{(1)}: \mu_1 = 2$ versus $H_1^{(1)}: \mu_1 \ne 2$ and a second test $H_0^{(2)}: \mu_2 = 2$ versus $H_1^{(2)}: \mu_2 \ne 2$ but with an adjusted significance level $\alpha^{adj.}$ that is such that $1-(1-\alpha^{adj.})^2=0.05$ or
$(1-\alpha^{adj.})^2=0.95$ or $1-\alpha^{adj.}=\sqrt{0.95}$ or $\alpha^{adj.}=1-\sqrt{0.95}$ which yields $\alpha^{adj.}=0.02532057$.

In this case we will accept $H_0^{(1)}$ and $H_0^{(1)}$ (and both together are equivalent to our ''original'' $H_0: \mu_1 = 2 \& \mu_2=2$) whenever $\frac{x_1 – \mu_1}{\sigma_1} \le z_{\alpha^{adj.}} $ and $\frac{x_2 – \mu_2}{\sigma_2} \le z_{\alpha^{adj.}} $

So we conclude that, with multiple testing, the acceptance region for $x_1,x_2$ has become a rectangle with center $(\mu_1,\mu_2)$ and with a probability mass of $1-\alpha$ on top of it.

Conclusion

So we find that, for a joint ($\chi^2$) test the geometrical shape of the acceptance region is an ellipse, while with multiple testing it is a rectangle. The density mass ''on top'' of the acceptance region is in both cases 0.95.

Questions

So what is then the problem with multiple testing ? If there exists such a problem, then (see supra) the same problem should exist for joint tests or not ? The reason can not be that we prefer ellipses over rectangles does it ?

Best Answer

I think you are missing @FrankHarrell's point here (I do not currently have access to the Perneger's paper discussed in the linked thread, so cannot comment on it).

The debate is not about math, it is about philosophy. Everything you wrote here is mathematically correct, and clearly Bonferroni correction allows to control the familywise type I error rate, as your "joint test" also does. The debate is not at all about the specifics of Bonferroni itself, it is about multiple testing adjustments in general.

Everybody knows an argument for multiple testing corrections, as illustrated by the famous XKCD jelly beans comic:

enter image description here

Here is a counter-argument: if I developed a really convincing theory predicting that specifically green jelly beans should cause acne; and if I ran experiment to test for it and got nice and clear $p=0.003$; and if it so happened that some other PhD student in the same lab for whatever reason ran nineteen tests for all other jelly beans colors getting $p>05$ every time; and if now our advisor wants to put all of that in one single paper; -- then I would be totally against "adjusting" my p-value from $p=0.003$ to $p=0.003\cdot 20 = 0.06$.

Note that the experimental data in the Argument and in the Counter-Argument might be exactly the same. But the interpretation differs. This is fine, but illustrates that one should not be obliged by doing multiple testing corrections in all situations. It is ultimately a matter of judgment. Crucially, real-life scenarios are usually not as clear cut as here and tend to be in between #1 and #2. See also Frank's example in his answer.

Related Question