Solved – How should an individual researcher think about the false discovery rate

false-discovery-ratep-valuepublication-biasstatistical significance

I've been trying to wrap my head around how the False Discovery Rate (FDR) should inform the conclusions of the individual researcher. For example, if your study is underpowered, should you discount your results even if they're significant at $\alpha = .05$? Note: I'm talking about the FDR in the context of examining the results of multiple studies in aggregate, not as a method for multiple test corrections.

Making the (maybe generous) assumption that $\sim.5$ of hypotheses tested are actually true, the FDR is a function of both the type I and type II error rates as follows:

$$\text{FDR} = \frac{\alpha}{\alpha+1-\beta}.$$

It stands to reason that if a study is sufficiently underpowered, we should not trust the results, even if they are significant, as much as we would those of an adequately powered study. So, as some statisticians would say, there are circumstances under which, "in the long run", we might publish many significant results that are false if we follow the traditional guidelines. If a body of research is characterized by consistently underpowered studies (e.g., the candidate gene $\times$ environment interaction literature of the previous decade), even replicated significant findings can be suspect.

Applying the R packages extrafont, ggplot2, and xkcd, I think this might be usefully conceptualized as an issue of perspective:
A significant result...

Not so sure...

Given this information, what should an individual researcher do next? If I have a guess of what the size of the effect I'm studying should be (and therefore an estimate of $1 – \beta$, given my sample size), should I adjust my $\alpha$ level until the FDR = .05? Should I publish results at the $\alpha = .05$ level even if my studies are underpowered and leave consideration of the FDR to consumers of the literature?

I know this is a topic that has been discussed frequently, both on this site and in the statistics literature, but I can't seem to find a consensus of opinion on this issue.


EDIT: In response to @amoeba's comment, the FDR can be derived from the standard type I/type II error rate contingency table (pardon its ugliness):

|                            |Finding is significant |Finding is insignificant |
|:---------------------------|:----------------------|:------------------------|
|Finding is false in reality |alpha                  |1 - alpha                |
|Finding is true in reality  |1 - beta               |beta                     |

So, if we are presented with a significant finding (column 1), the chance that it is false in reality is alpha over the sum of the column.

But yes, we can modify our definition of the FDR to reflect the (prior) probability that a given hypothesis is true, though study power $(1 – \beta)$ still plays a role:

$$\text{FDR} = \frac{\alpha \cdot (1- \text{prior})}{\alpha \cdot (1- \text{prior}) + (1-\beta) \cdot \text{prior}}$$

Best Answer

In order to aggregate the results of multiple studies you should rather think of making your results accessible for meta analyses. A meta analysis considers the data of the study, or at least its estimates, models study effects and comes to a systematical conclusion by forming some kind of large virtual study out of many small single studies. The individual $p$-values, ficticious priors and planned power are not important input for meta analyses.

Instead, it is important to have all studies accessible, disregarding power levels or significant results. In fact, the bad habit of publishing only significant and concealing non-significant results leads to publication bias and corrupts the overall record of scientific results.

So the individual researcher should conduct a study in a reproducible way, keep all the records and log all experimental procedures even if such details are not asked by the publishing journals. He should not worry too much about low power. Even a noninformative result (= null hypothesis not rejected) would add more estimators for further studies, as long as one can afford sufficient quality of the data themselves.

If you would try to aggregate findings only by $p$-values and some FDR-considerations, you are picking the wrong way because of course a study with larger sample sizes, smaller variances, better controlled confounders is more reliable than other studies. Yet they all produce $p$-values and the best FDR procedure for the $p$-values can never make up for quality disparities.

Related Question