Solved – Sample-size calculations for Benjamini-Hochberg, Westfall-Young, Holm-Bonferroni methods

bonferronisample-size

I'm currently setting up an experiment where we might want to do multiple comparisons (i.e. comparing several treatments with a control at the same time). It is pretty straightforward to calculate the needed sample-size using the Bonferroni correction. However, the Bonferroni method is pretty conservative, and I'm worried that we're wasting time or resources getting more samples than we actually need.

Are there ways to calculate needed sample-size for other correction methods, such as Benjamini-Hochberg, Holm-Bonferroni, Westfall-Young correction?

Or, in your experience, are you likely to see any significant decrease (more than 5%) in sample-size using any of these other methods at all?

The test in question is a simple comparison of treatment effects on a categorical outcome variable, with expected mean at 50%.

Best Answer

What is typically done (though it is easier said than done), is this:

  • do a pilot study which gives you an idea of the data you're handling
  • based on this (or if a pilot study is not an option, knowledge about the domain), create a data generating model so you can sample data that 'looks like' the true data. Make this so that you can control which observations are cases (in the sense that you want them picked up by the tests) and which are controls (again, in the sense that you want them to not be picked up by the tests).
  • for each of a reasonable set of sample sizes, run 100 or 1000 simulations (i.e.: create that many datasets, the more the merrier), and run the analysis on it. Calculate how well, e.g., the false discovery rate performs.
  • now you have estimates for each sample size on how good each of your measures will perform, so pick your sample size for the performance you need (if at all attainable), and be conservative about it (i.e.: if you can, add another slab of obervations)

The difficulty in the above is obviously in creating that data generating model. Once again, when in doubt: make the 'true effects' small and add lots of noise to keep it on the conservative side.

I'm pretty sure that the original article on FDR holds examples of where FWER performs really bad, so it is to be expected that the sample size calculations could be very different with the different measures.