Solved – What does a post hoc power analysis say about the significance of the study results

post-hocstatistical-power

I orginially posted this over at AskMetafilter, and a commenter suggested I ask it here.

I work for a dietary supplement company that also makes skin care products, and some of those skin care products are tested clinically. Now they are talking about repeating some of the clinical tests in another region of the world in which the products will be sold because the marketing department thinks that would be good. The question came up: How do you decide how many subjects to include?

The answer, I learn, is statistical power. You include at least as many subjects as are needed to give you an appropriate power, say 80%, given your chosen significance level alpha and expected effect size, for the type of statistical test you are performing. What if p is small, lower than alpha, but power is also low? Does that invalidate the results?

For example, one of the studies they want to repeat used a two-tailed paired t test to compare before and after treatment means for a measurement. Alpha was 0.05 and population size was 30. The pooled SD for the two data sets was 0.688. After the fact, I calculated a Cohen's d of 0.494. All this gives a power of less than 50%, which means the study was underpowered. At the same time, p was 0.000004.

I can tell the people at work that, when the study is repeated, we are going to need more subjects or we risk missing the effect that we saw in the first trial, but what can we conclude about that first trial? Power was low, but p was much lower than alpha. Are the results no good? Or can we still trust that p? Or, what is also possible, am I completely confused and all this doesn't work the way I think?

Thanks for any help you can provide!

Best Answer

There are many related posts you may want to look at.

In genearl, I like to think of power like the resolution of a photo. So we might ask, how much resolution do you need to see the picture clearly? or, How much data do you need to see whether there are differences?

Looking at an image with coarse resolution (low power), it is hard to tell what you are seeing. But it is easier if the effect size is large or the image displays a very clear pattern. In conducting power analyses, I think it is usually best to fix the effect size at some meaningful value – the minimum effect size worth detecting.

The question becomes how many observations do I need to reliably detect this difference. This is like using the camera on my phone vs. an expensive DSLR camera. The more expensive camera might take higher resolution pictures and give me more fine detail – but chances are I can tell what I am looking at with the camera on my phone. In statistics, you are calculating the number of observations to achieve the specified power and alpha given that exact effect size. But if the actual effect size is greater, then you actually have more power.

I would suggest re-running your sample size calculations, but using the minimum effect size worth detecting – NOT the effect size you have already observed. This is because you should expect variation in the effect size and you may want to be able to detect an effect size smaller than what you have already observed.

Lastly, you did not include the standard deviation in your original post. If the standard deviation is sufficiently small, then your initial study may have been sufficiently powered.

In R:

> power.t.test(n =  15, delta = 0.49, sd = 0.46, sig.level = 0.05, type="two.sample")

#      Two-sample t test power calculation 
#
#              n = 15
#          delta = 0.49
#             sd = 0.66
#      sig.level = 0.05
#          power = 0.8039799
#    alternative = two.sided
#
# NOTE: n is number in *each* group