If a team of researchers perform multiple (hypothesis) tests on a given data set, there is a volume of literature asserting that they should use some form of correction for multiple testing (Bonferroni, etc), even if the tests are independent. My question is this: does this same logic apply to multiple teams testing hypotheses on the same data set? Said another way– what is the barrier for the family-wise error calculations? Should researchers be limited to reusing data sets for exploration only?
Solved – Family-wise error boundary: Does re-using data sets on different studies of independent questions lead to multiple testing problems
hypothesis testingmultiple-comparisons
Related Solutions
@John has a nice answer. I particularly like the discussion about fishing expeditions and how alpha-adjustment may not be necessary. I want to add one additional aspect to this discussion. With hypothesis testing, there are two different kinds of errors to worry about: type I and type II (also called alpha error and beta error). Both kinds are bad, and we want to avoid both of them. When people talk about alpha-adjustment, they are focusing only on the possibility of type I errors (that is, saying there is a difference when there isn't one). However, adjusting alpha to minimize type I errors necessarily decreases power. Thus, it necessarily increases the probability of type II errors (that is, saying there isn't a difference when in fact there is). In addition, it's worth noting that a-priori there is no reason to believe that type I errors are worse than type II errors (despite the fact that everyone seems to assume that this must be so). Rather, which is worse will vary from situation to situation and is a judgment that must be made by the researcher. In other words, deciding on a strategy for testing multiple comparisons (e.g., an alpha-adjustment strategy) one must consider the effect of the strategy on both type I and type II errors and balance these effects relative to: the severity of these errors, how much data you have, and the cost of gathering more.
On a different note, from your description it seems to me that your situation would best be analyzed by using a factorial ANOVA, with sex as factor 1, marital status as factor 2, language as factor 3, and age as factor 4. From the description (and I recognize that it is sparse) I don't see why a cell means approach (i.e., one-way ANOVA) is preferable. If you have no interest in interactions, the main effects from the factorial ANOVA are already orthogonal (at least if the $n$s are the same), and Bonferroni corrections are not relevant. Certainly it would still be possible to have more than 5% type I errors, but I'm a big believer in @John's fourth paragraph; when I'm testing theoretically suggested, a-priori, orthogonal contrasts, I don't use alpha-adjustments.
The Bonferroni adjustment will always provide strong control of the family-wise error rate. This means that, whatever the nature and number of the tests, or the relationships between them, if their assumptions are met, it will ensure that the probability of having even one erroneous significant result among all tests is at most $\alpha$, your original error level. It is therefore always available.
Whether it is appropriate to use it (as opposed to another method or perhaps no adjustment at all) depends on your objectives, the standards of your discipline and the availability of better methods for your specific situation. At the very least, you should probably consider the Holm-Bonferroni method, which is just as general but less conservative.
Regarding your example, since you are performing several tests, you are increasing the family-wise error rate (the probability of rejecting at least one null hypothesis erroneously). If you only perform one test on each half, many adjustments would be possible including Hommel's method or methods controlling the false discovery rate (which is different from the family-wise error rate). If you conduct a test on the whole data set followed by several sub-tests, the tests are no longer independent so some methods are no longer appropriate. As I said before, Bonferroni is in any case always available and guaranteed to work as advertised (but also to be very conservative…).
You could also just ignore the whole issue. Formally, the family-wise error rate is higher but with only two tests it's still not so bad. You could also start with a test on the whole data set, treated as the main outcome, followed by sub-tests for different groups, uncorrected because they are understood as secondary outcomes or ancillary hypotheses.
If you consider many demographic variables in that way (as opposed to just planning to test for gender differences from the get go or perhaps a more systematic modeling approach), the problem becomes more serious with a significant risk of “data dredging” (one difference comes out significant by chance allowing you to rescue an inconclusive experiment with some nice story about the demographic variable to boot whereas in fact nothing really happened) and you should definitely consider some form of adjustment for multiple testing. The logic remains the same with X different hypotheses (testing X hypotheses twice – one on each half of the data set – entails a higher family-wise error rate than testing X hypotheses only once and you should probably adjust for that).
Best Answer
I disagree strongly with @fcoppens leap from recognizing the importance of multiple-hypothesis correction within a single investigation to claiming that "By the same reasoning, the same holds if several teams perform these tests."
There is no question that the more studies are performed and the more hypotheses are tested, the more Type I errors will occur. But I think there's a confusion here over the meaning of "family-wise error" rates and how they apply in actual scientific work.
First, remember that multiple-testing corrections typically arose in post-hoc comparisons for which there were no pre-formulated hypotheses. It is not at all clear that the same corrections are required when there is a small pre-defined set of hypotheses.
Second, the "scientific truth" of an individual publication does not depend on the truth of each individual statement within the publication. A well-designed study approaches an overall scientific (as opposed to statistical) hypothesis from many different perspectives, and puts together different types of results to evaluate the scientific hypothesis. Each individual result may be evaluated by a statistical test.
By the argument from @fcoppens however, if even one of those individual statistical tests makes a Type I error then that leads to a "false belief of 'scientific truth'". This is simply wrong.
The "scientific truth" of the scientific hypothesis in a publication, as opposed to the validity of an individual statistical test, generally comes from a combination of different types of evidence. Insistence on multiple types of evidence makes the validity of a scientific hypothesis robust to the individual mistakes that inevitably occur. As I look back on my 50 or so scientific publications, I would be hard pressed to find any that remains so flawless in every detail as @fcoppens seems to insist upon. Yet I am similarly hard pressed to find any where the scientific hypothesis was outright wrong. Incomplete, perhaps; made irrelevant by later developments in the field, certainly. But not "wrong" in the context of the state of scientific knowledge at the time.
Third, the argument ignores the costs of making Type II errors. A type II error might close off entire fields of promising scientific inquiry. If the recommendations of @fcoppens were to be followed, Type II error rates would escalate massively, to the detriment of the scientific enterprise.
Finally, the recommendation is impossible to follow in practice. If I analyze a set of publicly available data, I may have no way of knowing whether anyone else has used it, or for what purpose. I have no way of correcting for anyone else's hypothesis tests. And as I argue above, I shouldn't have to.