Solved – Statistical significance in an underpowered study, false positive

statistical significancestatistical-power

So, I'm actually a biologist trying to wrap my head around the idea of power of analysis to help design an experiment with the proper sample size. I understand that power of analysis is used to help avoid type II errors, but I came across this paper:

http://www.benthamscience.com/open/toepij/articles/V003/16TOEPIJ.pdf

which seems to say (if I'm not mistaken) that in underpowered studies, you also increase your risk for false positives. The other thing that confuses me, is that if you keep adding more samples to increase your power, of course you make any finding statistically significant, even if your effect is small. Is there some sort of balance between power and number of samples so that one avoids the pitfalls of an underpowered experiment, and also an overpowered experiment where the results will produce a statically significant result that might not be interesting because the effect size is so small?

Best Answer

From a quick skim it seems like they're basically taking a Bayesian viewpoint and computing a particular probability (H0 true|reject if I understood what they were getting at) that they argue must go up as the sample size goes down; if that's what the claim was, then as far as that goes, it's valid, because the denominator in Bayes rule must decrease as the sample size goes down while the significance level and P(H0 true) are presumably fixed.

A frequentist would argue that their rate of reject|H0 true is fixed, and they'd likely say that's what they care about.

On the gripping hand, in the overwhelming majority of studies, the true overall rate of false&positives must be effectively zero at every sample size, since in most circumstances nulls are simply not exactly true.

I guess it comes down to what probability you want.

Related Solutions

Solved – What does a post hoc power analysis say about the significance of the study results

There are many related posts you may want to look at.

In genearl, I like to think of power like the resolution of a photo. So we might ask, how much resolution do you need to see the picture clearly? or, How much data do you need to see whether there are differences?

Looking at an image with coarse resolution (low power), it is hard to tell what you are seeing. But it is easier if the effect size is large or the image displays a very clear pattern. In conducting power analyses, I think it is usually best to fix the effect size at some meaningful value – the minimum effect size worth detecting.

The question becomes how many observations do I need to reliably detect this difference. This is like using the camera on my phone vs. an expensive DSLR camera. The more expensive camera might take higher resolution pictures and give me more fine detail – but chances are I can tell what I am looking at with the camera on my phone. In statistics, you are calculating the number of observations to achieve the specified power and alpha given that exact effect size. But if the actual effect size is greater, then you actually have more power.

I would suggest re-running your sample size calculations, but using the minimum effect size worth detecting – NOT the effect size you have already observed. This is because you should expect variation in the effect size and you may want to be able to detect an effect size smaller than what you have already observed.

Lastly, you did not include the standard deviation in your original post. If the standard deviation is sufficiently small, then your initial study may have been sufficiently powered.

In R:

> power.t.test(n =  15, delta = 0.49, sd = 0.46, sig.level = 0.05, type="two.sample")

#      Two-sample t test power calculation 
#
#              n = 15
#          delta = 0.49
#             sd = 0.66
#      sig.level = 0.05
#          power = 0.8039799
#    alternative = two.sided
#
# NOTE: n is number in *each* group

Solved – How should an individual researcher think about the false discovery rate

In order to aggregate the results of multiple studies you should rather think of making your results accessible for meta analyses. A meta analysis considers the data of the study, or at least its estimates, models study effects and comes to a systematical conclusion by forming some kind of large virtual study out of many small single studies. The individual $p$-values, ficticious priors and planned power are not important input for meta analyses.

Instead, it is important to have all studies accessible, disregarding power levels or significant results. In fact, the bad habit of publishing only significant and concealing non-significant results leads to publication bias and corrupts the overall record of scientific results.

So the individual researcher should conduct a study in a reproducible way, keep all the records and log all experimental procedures even if such details are not asked by the publishing journals. He should not worry too much about low power. Even a noninformative result (= null hypothesis not rejected) would add more estimators for further studies, as long as one can afford sufficient quality of the data themselves.

If you would try to aggregate findings only by $p$-values and some FDR-considerations, you are picking the wrong way because of course a study with larger sample sizes, smaller variances, better controlled confounders is more reliable than other studies. Yet they all produce $p$-values and the best FDR procedure for the $p$-values can never make up for quality disparities.

Best Answer

Related Solutions

Solved – What does a post hoc power analysis say about the significance of the study results

Solved – How should an individual researcher think about the false discovery rate

Related Question