Hypothesis Testing – Necessity of Correcting Multiple Comparisons with Large Sample Sizes

hypothesis testinglarge datamultiple-comparisonssamplingstatistical significance

I was having a discussion with a colleague today regarding corrections for multiple testing. We're planning on running a large number of tests (probably in the hundreds as a rough estimate) and thus I brought up multiple testing and how we should go about it. Specifically the fact that various outcomes are related (not independent) and thus this doesn't meet the assumptions of a Bonferroni, Benjamini-Hochberg or whatever method of adjustment. Their suggestion was that we don't have to worry about it, because the sample size of each test is going to be big enough (we're looking at minimum of n=100 but frequently n=1000+). This is the first time I've ever heard of such an approach and thus was a bit surprised by this. From my research and trying to think it through myself, perhaps their logic was that as n gets larger, sampling error becomes less of an issue, therefore it counteracts any increase in false positives?

My question ultimately is, is my colleague correct in suggesting the large sample sizes are sufficient for handling multiple testing and if so, is my reasoning of it being related to the sampling error accurate?

Best Answer

Think buying hundreds of fair dice. You do not know they are, though, and hence test if each has an expected value of 3.5 points, via throwing each many times (1000+). One of them must come up as "best", and if you do not account for multiple testing, almost certainly statistically significantly so.

Recall that the probability that a true null is rejected should (in practice, that may not be exactly true due to things like asymptotic approximations and finite-sample size distortions) not depend on sample size!

You might then conclude, wrongly (or at least not rightly, in that it is no better, but also no worse than the others), that this is the one you should bring to your next board game.

As to practical significance, this will indeed provide a clue, in that the "winning" one will likely have won with an average of points barely better than 3.5 when you tossed often.

Here is an illustration:

set.seed(1)
dice <- 100
throws <- 1000

tests <- apply(replicate(dice, sample(1:6, throws, 
               replace=T)), 2, 
               function(x) t.test(x, alternative="greater", 
 mu=3.5)) 
# right-tailed test, to look for "better" dice (assuming a 
# game where many points are good, nothing hinges on this)

plot(1:dice, sort(unlist(lapply(1:dice, function(i) 
   tests[[i]]$p.value))))
abline(h=0.05, col="blue")     # significance threshold not 
                       # accounting for multiple testing
abline(h=0.05/dice, col="red") # Bonferroni threshold

max(unlist(lapply(1:dice, function(i) tests[[i]]$estimate))) 
# the sample average of the "winner"

enter image description here

So we see a few "significantly" outperforming dice at level 0.05, but none, in this simulation run, after Bonferroni correction. The "winning" one (last line of the code) however has an average of 3.63, which is, in practice, not too far away from the true expectation 3.5.

We can also run a little Monte Carlo exercise - i.e., the above exercise many times so as to average out any "uncommon" samples that might arise from set.seed(1). We can then also illustrate the effect of varying the number of throws.

# Monte Carlo, with several runs of the experiment:

reps <- 500

mc.func.throws <- function(throws){
  tests <- apply(replicate(dice, sample(1:6, throws, 
                 replace=T)), 2, 
                 function(x) t.test(x, alternative="greater", 
                      mu=3.5)) 
  winning.average <- max(unlist(lapply(1:dice, function(i) 
  tests[[i]]$estimate))) # the sample average of the "winner"
  significant.pvalues <- mean(unlist(lapply(1:dice, 
      function(i) tests[[i]]$p.value)) < 0.05)
  return(list(winning.average, significant.pvalues))
}

diff.throws <- function(throws){
  mc.study <- replicate(reps, mc.func.throws(throws))
  
  average.winning.average <- mean(unlist(mc.study[1,]))
  mean.significant.results <- mean(unlist(mc.study[2,]))
  return(list(average.winning.average, 
 mean.significant.results))
}

throws <- c(10, 50, 100, 500, 1000, 10000)

lapply(throws, diff.throws)

Result:

> unlist(lapply(mc.throws, `[[`, 1))
[1] 4.809200 4.108400 3.927120 3.692292 3.635224 3.542961

> unlist(lapply(mc.throws, `[[`, 2))
[1] 0.04992 0.05134 0.05012 0.04964 0.05006 0.05040

Hence, as predicted, the proportion of statistically significant results is independent of the number of throws (all proportions of $p$-values less than 0.05 are close to 0.05), while the practical significance - i.e., the distance of the average number of points of the "best" one to 3.5 - decreases in the number of throws.

Related Question