I was having a discussion with a colleague today regarding corrections for multiple testing. We're planning on running a large number of tests (probably in the hundreds as a rough estimate) and thus I brought up multiple testing and how we should go about it. Specifically the fact that various outcomes are related (not independent) and thus this doesn't meet the assumptions of a Bonferroni, Benjamini-Hochberg or whatever method of adjustment. Their suggestion was that we don't have to worry about it, because the sample size of each test is going to be big enough (we're looking at minimum of n=100 but frequently n=1000+). This is the first time I've ever heard of such an approach and thus was a bit surprised by this. From my research and trying to think it through myself, perhaps their logic was that as n gets larger, sampling error becomes less of an issue, therefore it counteracts any increase in false positives?
My question ultimately is, is my colleague correct in suggesting the large sample sizes are sufficient for handling multiple testing and if so, is my reasoning of it being related to the sampling error accurate?
Best Answer
Think buying hundreds of fair dice. You do not know they are, though, and hence test if each has an expected value of 3.5 points, via throwing each many times (1000+). One of them must come up as "best", and if you do not account for multiple testing, almost certainly statistically significantly so.
Recall that the probability that a true null is rejected should (in practice, that may not be exactly true due to things like asymptotic approximations and finite-sample size distortions) not depend on sample size!
You might then conclude, wrongly (or at least not rightly, in that it is no better, but also no worse than the others), that this is the one you should bring to your next board game.
As to practical significance, this will indeed provide a clue, in that the "winning" one will likely have won with an average of points barely better than 3.5 when you tossed often.
Here is an illustration:
So we see a few "significantly" outperforming dice at level 0.05, but none, in this simulation run, after Bonferroni correction. The "winning" one (last line of the code) however has an average of 3.63, which is, in practice, not too far away from the true expectation 3.5.
We can also run a little Monte Carlo exercise - i.e., the above exercise many times so as to average out any "uncommon" samples that might arise from
set.seed(1)
. We can then also illustrate the effect of varying the number of throws.Result:
Hence, as predicted, the proportion of statistically significant results is independent of the number of throws (all proportions of $p$-values less than 0.05 are close to 0.05), while the practical significance - i.e., the distance of the average number of points of the "best" one to 3.5 - decreases in the number of throws.