Underpowered Studies – Increased Likelihood of False Positives in Hypothesis Testing

false-discovery-ratehypothesis testingstatistical-power

This question has been has asked before here and here but I don't think the answers address the question directly.

Do underpowered studies have increased likelihood of false positives? Some news articles make this assertion. For example:

Low statistical power is bad news. Underpowered studies are more
likely to miss genuine effects, and as a group they're more likely to
include a higher proportion of false positives — that is, effects
that reach statistical significance even though they are not real.

As I understand it, the power of a test can be increased by:

  • increasing the sample size
  • having a larger effect size
  • increasing the significance level

Assuming we don't want to change the significance level, I believe the quote above refers to changing the sample size. However, I don't see how decreasing the sample should increase the number of false positives. To put it simply, reducing the power of a study increases the chances of false negatives, which responds to the question:

$$P(\text{failure to reject }H_{0}|H_{0}\text{ is false})$$

On the contrary, false positives respond to the question:

$$P(\text{reject }H_{0}|H_{0}\text{ is true})$$

Both are different questions because the conditionals are different. Power is (inversely) related to false negatives but not to false positives. Am I missing something?

Best Answer

You are correct in that sample size affects power (i.e. 1 - type II error), but not type I error. It's a common misunderstanding that a p-value as such (correctly interpreted) is less reliable or valid when the sample size is small - the very entertaining article by Friston 2012 has a funny take on that [1].

That being said, the issues with underpowered studies are real, and the quote is largely correct I would say, only a bit imprecise in its wording.

The basic problem with underpowered studies is that, although the rate of false positives (type I error) in hypothesis tests is fixed, the rate of true positives (power) goes down. Hence, a positive (= significant) result is less likely to be a true positive in an underpowered study. This idea is expressed in the false discovery rate [2], see also [3]. This seems what the quote refers to.

An additional issue often named regarding underpowered studies is that they lead to overestimated effect sizes. The reasons is that a) with lower power, your estimates of the true effects will become more variable (stochastic) around their true value, and b) only the strongest of those effects will pass the significance filter when the power is low. One should add though that this is a reporting problem that could easily be fixed by discussing and reporting all and not only significant effects.

Finally, an important practical issue with underpowered studies is that low power increases statistical issues (e.g. bias of estimators) as well as the temptation for playing around with variables and similar p-hacking tactics. Using these "researcher degrees of freedom" is most effective when the power is low, and THIS can increase type I error after all, see, e.g., [4].

For all these reasons, I would therefore be indeed skeptical about an underpowered study.

[1] Friston, K. (2012) Ten ironic rules for non-statistical reviewers. NeuroImage, 61, 1300-1310.

[2] https://en.wikipedia.org/wiki/False_discovery_rate

[3] Button, K. S.; Ioannidis, J. P. A.; Mokrysz, C.; Nosek, B. A.; Flint, J.; Robinson, E. S. J. & Munafo, M. R. (2013) Power failure: why small sample size undermines the reliability of neuroscience. Nat. Rev. Neurosci., 14, 365-376

[4] Simmons, J. P.; Nelson, L. D. & Simonsohn, U. (2011) False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychol Sci., 22, 1359-1366.