Solved – How to calculate power (or sample size) for a multiple comparison experiment

hypothesis testingmultiple-comparisonssample-sizestatistical-power

I collected data on 20 groups (with 30 elements each). A multiple comparison procedure (pairwise t-test with Holm correction) shows that in general there are three sets of groups: the high with 4 groups, the low with 2 groups, and the middle with the remaining 14 groups. Each set is not significantly different for the groups within but it is significantly different from the groups in the other sets. (This is a simplification, because there are some other significant and non-significant results for the extremes of each set, but I am making a simplification of the results so I can write a concise summary of the experiment both to you and to the readers of the paper.)

If this result is going to be used for decision making, for example treating members of groups the middle set as equivalent, one must be sure that the results are "real" and not just due to the small sample size.

Thus I need to calculate some measure of power (power = 1- the probability of accepting H0 when it is false) or some measure of sample size to show that either a new experiment with a larger sample size is needed, or that indeed the differences are "probably true".

But statistical power of WHAT?

  1. It is not of the whole 20 groups ANOVA, since that analysis rejected the null.
  2. Should I run the ANOVA of the 14 groups in the middle set and calculate the power of that? But that seems will overestimate the power (or underestimate the needed sample size) since the extreme groups in the middle set are "almost" different.
  3. Should I calculate the power for the least significant pairwise t-test in the middle group (with a Bonferroni corrected alpha)? But that will terribly underestimate the power since the two most similar groups are very likely "really" not different.

Any ideas? Any references I can follow?

What I know so far:

  1. The R package pwr calculates the power or sample size for t-test, one way ANOVA, and other tests.
  2. On the relative sample size required for multiple comparisons, by Witte, Elston AND Cardon discusses the use of the Bonferroni corrected alpha values in the calculations of sample size for multiple comparisons.

EDIT – Aug 2013

There has been some upvote movement in this question, so I decided to add some more information, or better clarification regarding this topic.

I did not quite agree with the two answers posted. I do not think it is a data-mining/clustering problem. But probably I did not phrase the question correctly. That paper is published so I can not only point to it, here, but also discuss what I needed.

In the paper I (and colleagues) discuss the differences between productivity and citations among different computer science subareas, based on a random sample of 30 researchers in each area. The paper includes a compact letter display that shows the significant differences between any two of the 20 CS subareas. But I wanted to show significant equivalences between the areas. That is when it is very likely that two areas have the same productivity or the same citations per paper, given the 30 sample points for each area.

I know of equivalence tests (or Two One Sided Tests – TOST) – there have been some discussions in CV on that, but nowhere did I see multiple equivalence tests!

My idea was to use power was that the definition of power = 1- the probability of accepting H0 when it is false is exactly what I need to state that two areas have the same productivity – I make the statement that they have the same productivity (H0) and that statement is true with "power" confidence level!

I still do not know how to do that, and the paper has no statement of probable equivalence between some CS areas, which is in fact the more interesting result!

I would again appreciate any comments or help.

Best Answer

If you have already done the experiment then there is little point in doing any power analyses. Where the P-values are small the power for the observed effect size and variability was large enough. Where the P-values are large then the power was small for the observed effect size and variability. Power analysis is useful for planning experiments, but not useful after the fact. See this paper by Hoenig & Helsey: http://www.tandfonline.com/doi/abs/10.1198/000313001300339897#preview

Your desire for a power analysis appears to be based on this statement "one must be sure that the results are 'real' and not just due to the small sample size", and so it is useful to consider it closely. Firstly, statistical analysis cannot tell you about the reality of a result--something that you probably know, given that you put the 'real' in quotes. Second, you imply that a small sample is more likely to yield a false positive result, when the reality is that a small sample is exactly as likely to do that as a large sample. The small sample is more likely to yield a false negative result.

If you want to be confident that the results yield reliable conclusions then you have to consider their nature in light of what is known about the system and, ideally, replicate the parts of the study that are most interesting or surprising. (I acknowledge that a well-judged statistical analysis is more helpful here than a poorly judged one: see Julien Sturnemann's answer for some suggestions.)

Related Question