Solved – Family-wise error boundary: Does re-using data sets on different studies of independent questions lead to multiple testing problems

hypothesis testingmultiple-comparisons

If a team of researchers perform multiple (hypothesis) tests on a given data set, there is a volume of literature asserting that they should use some form of correction for multiple testing (Bonferroni, etc), even if the tests are independent. My question is this: does this same logic apply to multiple teams testing hypotheses on the same data set? Said another way– what is the barrier for the family-wise error calculations? Should researchers be limited to reusing data sets for exploration only?

Best Answer

I disagree strongly with @fcoppens leap from recognizing the importance of multiple-hypothesis correction within a single investigation to claiming that "By the same reasoning, the same holds if several teams perform these tests."

There is no question that the more studies are performed and the more hypotheses are tested, the more Type I errors will occur. But I think there's a confusion here over the meaning of "family-wise error" rates and how they apply in actual scientific work.

First, remember that multiple-testing corrections typically arose in post-hoc comparisons for which there were no pre-formulated hypotheses. It is not at all clear that the same corrections are required when there is a small pre-defined set of hypotheses.

Second, the "scientific truth" of an individual publication does not depend on the truth of each individual statement within the publication. A well-designed study approaches an overall scientific (as opposed to statistical) hypothesis from many different perspectives, and puts together different types of results to evaluate the scientific hypothesis. Each individual result may be evaluated by a statistical test.

By the argument from @fcoppens however, if even one of those individual statistical tests makes a Type I error then that leads to a "false belief of 'scientific truth'". This is simply wrong.

The "scientific truth" of the scientific hypothesis in a publication, as opposed to the validity of an individual statistical test, generally comes from a combination of different types of evidence. Insistence on multiple types of evidence makes the validity of a scientific hypothesis robust to the individual mistakes that inevitably occur. As I look back on my 50 or so scientific publications, I would be hard pressed to find any that remains so flawless in every detail as @fcoppens seems to insist upon. Yet I am similarly hard pressed to find any where the scientific hypothesis was outright wrong. Incomplete, perhaps; made irrelevant by later developments in the field, certainly. But not "wrong" in the context of the state of scientific knowledge at the time.

Third, the argument ignores the costs of making Type II errors. A type II error might close off entire fields of promising scientific inquiry. If the recommendations of @fcoppens were to be followed, Type II error rates would escalate massively, to the detriment of the scientific enterprise.

Finally, the recommendation is impossible to follow in practice. If I analyze a set of publicly available data, I may have no way of knowing whether anyone else has used it, or for what purpose. I have no way of correcting for anyone else's hypothesis tests. And as I argue above, I shouldn't have to.

Related Question