Solved – How to calculate power (or sample size) for a multiple comparison experiment

hypothesis testingmultiple-comparisonssample-sizestatistical-power

I collected data on 20 groups (with 30 elements each). A multiple comparison procedure (pairwise t-test with Holm correction) shows that in general there are three sets of groups: the high with 4 groups, the low with 2 groups, and the middle with the remaining 14 groups. Each set is not significantly different for the groups within but it is significantly different from the groups in the other sets. (This is a simplification, because there are some other significant and non-significant results for the extremes of each set, but I am making a simplification of the results so I can write a concise summary of the experiment both to you and to the readers of the paper.)

If this result is going to be used for decision making, for example treating members of groups the middle set as equivalent, one must be sure that the results are "real" and not just due to the small sample size.

Thus I need to calculate some measure of power (power = 1- the probability of accepting H0 when it is false) or some measure of sample size to show that either a new experiment with a larger sample size is needed, or that indeed the differences are "probably true".

But statistical power of WHAT?

It is not of the whole 20 groups ANOVA, since that analysis rejected the null.
Should I run the ANOVA of the 14 groups in the middle set and calculate the power of that? But that seems will overestimate the power (or underestimate the needed sample size) since the extreme groups in the middle set are "almost" different.
Should I calculate the power for the least significant pairwise t-test in the middle group (with a Bonferroni corrected alpha)? But that will terribly underestimate the power since the two most similar groups are very likely "really" not different.

Any ideas? Any references I can follow?

What I know so far:

The R package pwr calculates the power or sample size for t-test, one way ANOVA, and other tests.
On the relative sample size required for multiple comparisons, by Witte, Elston AND Cardon discusses the use of the Bonferroni corrected alpha values in the calculations of sample size for multiple comparisons.

EDIT – Aug 2013

There has been some upvote movement in this question, so I decided to add some more information, or better clarification regarding this topic.

I did not quite agree with the two answers posted. I do not think it is a data-mining/clustering problem. But probably I did not phrase the question correctly. That paper is published so I can not only point to it, here, but also discuss what I needed.

In the paper I (and colleagues) discuss the differences between productivity and citations among different computer science subareas, based on a random sample of 30 researchers in each area. The paper includes a compact letter display that shows the significant differences between any two of the 20 CS subareas. But I wanted to show significant equivalences between the areas. That is when it is very likely that two areas have the same productivity or the same citations per paper, given the 30 sample points for each area.

I know of equivalence tests (or Two One Sided Tests – TOST) – there have been some discussions in CV on that, but nowhere did I see multiple equivalence tests!

My idea was to use power was that the definition of power = 1- the probability of accepting H0 when it is false is exactly what I need to state that two areas have the same productivity – I make the statement that they have the same productivity (H0) and that statement is true with "power" confidence level!

I still do not know how to do that, and the paper has no statement of probable equivalence between some CS areas, which is in fact the more interesting result!

I would again appreciate any comments or help.

Best Answer

If you have already done the experiment then there is little point in doing any power analyses. Where the P-values are small the power for the observed effect size and variability was large enough. Where the P-values are large then the power was small for the observed effect size and variability. Power analysis is useful for planning experiments, but not useful after the fact. See this paper by Hoenig & Helsey: http://www.tandfonline.com/doi/abs/10.1198/000313001300339897#preview

Your desire for a power analysis appears to be based on this statement "one must be sure that the results are 'real' and not just due to the small sample size", and so it is useful to consider it closely. Firstly, statistical analysis cannot tell you about the reality of a result--something that you probably know, given that you put the 'real' in quotes. Second, you imply that a small sample is more likely to yield a false positive result, when the reality is that a small sample is exactly as likely to do that as a large sample. The small sample is more likely to yield a false negative result.

If you want to be confident that the results yield reliable conclusions then you have to consider their nature in light of what is known about the system and, ideally, replicate the parts of the study that are most interesting or surprising. (I acknowledge that a well-judged statistical analysis is more helpful here than a poorly judged one: see Julien Sturnemann's answer for some suggestions.)

Related Solutions

Solved – How to calculate sample sizes for multiple treatments

First, we have to think clearly about what tests we're going to conduct. I'm not a huge fan of using the linear probability model, although it can be done in this case, as you only have categorical explanatory variables. (Note that you cannot have a constant SD with binary data where the proportions differ.) Do you just want a test that the conditions differ? Do you have to follow that up with planned comparisons? What test for those? How do you want to account for multiple comparisons? You can't just 'do a power analysis' until a lot of decisions have been made. None of that is meant to be critical; I'm trying to point out why you can't just find a simple answer by Googling.

I'm not generally a fan of canned power analyses, unless the situation is very simple and directly maps onto a simple, canonical test. In general, I prefer to simulate the alternative hypothesis / data generating process that I am proposing and conduct the sequence of tests that I intend. This also helps me to think through the statistical analysis plan for the study, and helps me think about what the data could look like, what I might think about them, and what I'd conclude. There are a lot more nuances than people often realize. For a more detailed exposition, it may help you to read through my answer here: Simulation of logistic regression power analysis - designed experiments (the code is rather clunky, but hopefully easy to follow).

Fortunately perhaps, your situation does correspond to simple analyses where it is easy to apply a canned power analysis. Specifically, if you just want to see if the three conditions differ, given that you only have three categorical conditions and the outcome data are binary (lived/died), this corresponds to a chi-squared test of a 2x3 contingency table. Alternatively, if you just want to test if $T_1$ differs from $C$, and if $T_2$ differs from $C$, you can conduct two $z$-tests of differences in proportions. Those won't be independent, so you might want to use a Bonferroni correction, in which case you simply use $\alpha=.025$ in your power analyses, and then use whichever $n$ is bigger. I can demonstrate these using the pwr library in R. (It might help you to work through the introductory vignette.)

First, I input the probabilities you have specified as the alternative hypothesis. Then I compute Cohen's measure of effect size, $w$, for a two-way contingency table. The contingency table will have $(r-1)(c-1)=2$ degrees of freedom, so we can simply get the required $N$ using the canned function ?pwr.chisq.test:

library(pwr)

#             C   T1   T2               # conditions
P = rbind(c(.20, .15, .10),             # prob die
          c(.80, .85, .90) )            # prob live
P = P/3;  P                             # matrix of cell probabilities
#            [,1]      [,2]       [,3]
# [1,] 0.06666667 0.0500000 0.03333333
# [2,] 0.26666667 0.2833333 0.30000000
w = ES.w2(P=P);  w  # [1] 0.1143324     # Cohen's measure of effect size w
pwr.chisq.test(w=w, N=NULL, df=2, sig.level=.05, power=.80)
# 
#      Chi squared power calculation 
# 
#               w = 0.1143324
#               N = 737.0537
#              df = 2
#       sig.level = 0.05
#           power = 0.8
# 
# NOTE: N is the number of observations
ceiling(737.0537/3)  # [1] 246  # you'll need n=246 participants in each condition

A different approach is simply to conduct two separate tests of the treatment conditions against the control. Since these aren't independent, we can test both against a lower alpha. Once again, first we stipulate the probabilities you want to be able to detect, then compute Cohen's measure of effect size, $h$. From there, it's easy to get the required $N$ from the canned function ?pwr.2p.test:

h1 = ES.h(.20, .15);  h1  # [1] 0.1318964  # Cohen's measure of effect size h
h2 = ES.h(.20, .10);  h2  # [1] 0.2837941

pwr.2p.test(h=h1, n=NULL, sig.level=0.025, power=.80)
# 
#      Difference of proportion power calculation for binomial distribution 
#        (arcsine transformation) 
# 
#               h = 0.1318964
#               n = 1092.743
#       sig.level = 0.025
#           power = 0.8
#     alternative = two.sided
# 
# NOTE: same sample sizes
pwr.2p.test(h=h2, n=NULL, sig.level=0.025, power=.80)
# 
#      Difference of proportion power calculation for binomial distribution
#        (arcsine transformation) 
# 
#               h = 0.2837941
#               n = 236.0353
#       sig.level = 0.025
#           power = 0.8
#     alternative = two.sided
# 
# NOTE: same sample sizes

This route implies you'll need $1093$ participants in each condition. That's a lot of data! However, it may be closer to what you really want to be able to demonstrate. It's worth remembering at this point that there's very little information in a binary data point, there's less the closer the probability gets to the upper or lower bound, and $.15$ is really close to $.20$ (although I acknowledge that every life is precious so the small difference may nonetheless be clinically meaningful).

If you're really committed to using the linear probability model, and want to show that each condition differs from the others, we need to move to a simulation-based approach. How do you want to address the necessary heteroscedasticity? Among other options, you could use weighted least squares—I'll do that here. How do you want to conduct the multiple comparisons? There are lots of ways; in this case I'll use Tukey's test.

The power analyses above give me a ballpark estimate of where to start. This will require a lot of computation, so I take some steps to make it faster: I generate all the data and the weights ahead of time. I try to minimize the number of calculations that I'm asking R to perform, etc. Done this way, it only takes my old machine about 15 seconds. I'm assuming the analytical plan is to first determine if there is a significant global effect, and if so, you want to go further and show that all three conditions differ. Thus, you want four significant p-values for the study to be considered successful. That is, we are solving for all-ways power (see my linked answer at top).

set.seed(906)  # this makes the example exactly reproducible
n     = 1093   # number of patients per arm
B     = 1000   # number of iterations in the simulation
p.mat = matrix(NA, nrow=4, ncol=B)                 # matrix to store the p-values
cond  = rep(c("C", "T1", "T2"), each=n)            # condition variable
y.mat = matrix(c(rbinom(n*B, size=1, prob=.20),    # resulting data
                 rbinom(n*B, size=1, prob=.15),
                 rbinom(n*B, size=1, prob=.10) ),
               nrow=n*3, ncol=B, byrow=T)
w.mat = matrix(NA, nrow=n*3, ncol=B)               # matrix to store the weights
i2s = n+1;  i2e = 2*n;  i3s = (2*n)+1;  i3e = 3*n  # row indexes
for(j in 1:B){                                     # computing the weights
  w.mat[1:n,j]     = 1/(n*mean(y.mat[1:n,j]     )*(1-mean(mean(y.mat[1:n,j]))))
  w.mat[i2s:i2e,j] = 1/(n*mean(y.mat[i2s:i2e,j])*(1-mean(mean(y.mat[i2s:i2e,j]))))
  w.mat[i3s:i3e,j] = 1/(n*mean(y.mat[i3s:i3e,j])*(1-mean(mean(y.mat[i3s:i3e,j]))))
}
for(j in 1:B){                         # fitting the models & storing the p-values
  m            = aov(y.mat[,j]~cond, weights=w.mat[,j])
  p.mat[1,j]   = summary(m)[[1]][1,5]  # global p-value
  p.mat[2:4,j] = TukeyHSD(m)$cond[,4]  # 3 p-values for comparisons
}
## power: i.e., the proportion of runs where all p's were significant
mean(apply(p.mat, 2, function(j){  mean(j<.05)==1  }))  # [1] 0.676

With this analytical strategy, using $n = 1093$ in each arm ($N = 3279$), I estimate you have $\approx 68\%$ power to show that all three conditions differ from each other. If you want, you can search over larger $n$'s to find how many patients it would take to achieve $80\%$ power.

Best Answer

Related Solutions

Solved – How to calculate sample sizes for multiple treatments

Related Question