Solved – consensus on adjusting alpha for multiple contrasts if the main effect is significant

anovabonferronicontrastsmultiple-comparisonspost-hoc

After talking to a couple of statisticians, reading some book sections, internet articles, and forums like this, I am still unclear about multiplicity adjustments of post hoc tests and contrasts.

Let's use an example: there are 4 groups of subjects – #1 is the control, and groups 2-4 are treated with different drugs. We're trying to find out which drug has had a significant effect compared to the control group.

Now we run an ANOVA and let's say the main effect of drug is significant. All this means is that some of the groups are different from each other, which doesn't tell us anything useful. In essence, we need non-orthogonal contrasts that compare:

  • 1 vs 2
  • 1 vs 3
  • 1 vs 4

This will tell us which drug had any effect compared to the control subjects. This is where problems begin. Some sources say that if you're running 3 contrasts, you need to apply a multiplicity adjustment like Bonferroni, Tukey, Sidak, etc. These tests decrease the value of alpha (0.05) to a stricter level. We have 3 comparisons, so with Bonferroni, the comparison needs to be $p<0.017$ in order to be significant (0.05/3). But other sources say that if the main effect is significant, then no correction needs to apply and you could run Fisher's LSD with $\alpha=0.05$. It has been a major headache to try and figure this out, so I put together a list of the most important questions about multiple comparisons to see if there is any consensus out there, or at least a more commonly accepted solution:

  1. If the main effect is significant, is it necessary to adjust contrasts for multiplicities or not? If the answer is "sometimes," please specify the conditions.

    ANSWER: It seems that contrasts need to be adjusted for multiplicity only if they are non-orthogonal. Orthogonal contrasts appear to need no correction. (DSUS, p.455)

  2. Assuming that there are no conditions that prevent the usage of any particular type of correction, is it acceptable to use only the most powerful/least conservative correction? The list includes Bonferroni, Sidak, Tukey, Holm-Sidak, Holm-Bonferroni, Dunnett, etc. If this is not acceptable, please elaborate. I have read different sources state opposite arguments.

    ANSWER: Dave Howell himself says that " It is perfectly acceptable to calculate the size of the critical value under a number of different tests, and then choose the test with the smallest critical value." (Multiple Comparisons with Repeated Measures)

  3. Given the above answer, it would be helpful if we could see the general guidelines as to which multiplicity adjustment tests have more power and under what conditions. For example, it seems that Tukey is too conservative for small sample sizes and that Bonferroni is more conservative than Sidak. I have read that Holm-Sidak and Holchberg's GT2 with Games-Howell procedures are very powerful, especially for unbalanced data with unequal variances (DSUS, p.459).

  4. Post hoc tests are essentially a group of contrasts. It seems that if a post hoc is available for the given analysis and software (such as SPSS), there is no point in running a contrast unless you're interested in combining several of the groups together, which post hoc can't do. Otherwise, it's much easier to run a post hoc instead since it automatically applies the necessary corrections. Please clarify if this understanding is correct.

    ANSWER: I ran some ANOVA simulations in SPSS and found that post hoc LSD p values (which does are not adjusted for multiplicities) are identical to contrast p values. Unadjusted post hoc must indeed be the same as a contrast. So a contrast should only be used if a post hoc cannot handle the given hypothesis. For example, post hoc won't work if you're trying to compare only some of the groups in your data set or if your hypothesis calls for a combination of groups, such as control vs the average of 3 treatment groups. In all other cases, post hoc analyses make it much easier and less cumbersome to carry out the calculations.

  5. How does the issue of multiplicity correction apply to simple main effects? This is applicable when you have multiple levels within each group (such as repeated measures) and want to find out exactly which groups differ at a given level.

  6. Does the discussion of multiplicity correction apply to mixed-models just as it does to ANOVA or are the approaches different here?
  7. It seems to be accepted that significant ANOVA is not necessary to run a post hoc adjusted for multiplicities (Hsu, p.177; Motulsky). So if a particular hypothesis doesn't need ANOVA, is there a better/more efficient way to run a "post hoc," such as without having to run ANOVA at all?

    ANSWER (partial?): It seems to me that since t-tests are a special case of ANOVA, we should be able to avoid running and ANOVA by running several t-tests instead. But this will be cumbersome because after the t-tests are done, their p-values will have to be manually adjusted. It won't be hard to do a Bonferroni adjustment, but something like Dunnett or Holm-Sidak is not so clear. I'm also not clear how t-test can be utilized if there are repeated measures. This answer needs to be expanded or corrected.

  8. Finally, is it safe to assume that if the main effect is not significant, then unadjusted post hocs/contrasts would be out of the question?

I'm hoping for a healthy discussion, if not conclusive answers. The latter is far more preferable, of course. My contention is that if there is no clear consensus among statisticians about a given topic, than the end-users, such as researchers, should essentially be free to use whatever suits their needs.

UPDATE 04/27/2015: The fact that no one has contributed anything yet shows how poorly are multiplicity adjustments understood even among the more advanced users of statistics. I have updated some points with interesting references/answers. Need more input though.


Response to @Bonferroni's answer from Aug 3, 2016.

Thanks for the reference. Since my OP I have read more about these problems and also talked to some statisticians. In general, I think you are taking an overly restrictive approach to a problem that doesn’t have a clear consensus. I don’t know about Frane’s credentials and he doesn’t have that many publications/citations, but for an opposing opinion, please see Nakagawa, who has a solid track record in stats, including advanced techniques like mixed models. Forgetting the argument about planned vs unplanned or orthogonal vs non-orthogonal comparisons, Nakagawa talks about getting rid of multiplicity adjustments altogether and makes interesting points to that end. He’s not alone with that view.

I don’t know if you have a specific reference that delineates why choosing the most powerful adjustment method is problematic. Per my reference, there is no consensus on this either. I don’t see an issue with going with the most powerful adjustment, like Dunnett or sequential Holm-Sidak (if CIs are not needed). So if someone knows the theory a priori, he will apply the most powerful test. But those that don’t know, will simply run several tests and stumble upon the most powerful test by trial and error. It’s problematic to say that each adjustment is a separate test that itself requires an adjustment. For example, what if you run regression and then discover that the residuals are just too skewed and have to run a different test after that? By that logic, this would also require a multiplicity adjustment, but I've never seen any rules.

Keep in mind that in the vast majority of scientific research family-wise error is not adjusted. In fact, advanced statistical packages like SPSS and SAS do not even have a way to adjust it (I actually talked to IBM about this). If the given experiment had let’s say 20 repeated measures, accounting for family-wise error would annihilate the power of the test in most practical experimental designs.

I hope that there will be more contributions to this thread. Eventually I will do a major edit to the OP as I’m learning more all the time. The more stats papers I read, the clearer it becomes how much art there is in statistics.

Best Answer

  1. The idea that only non-orthogonal comparisons require adjustment is a myth. See section 6.1 of Frane (2015): http://jrp.icaap.org/index.php/jrp/article/view/514/417

  2. In general, computing several alternate statistics and picking the one that gives you the answer you like best is a bad policy and can cause error inflation (as it's a form of multiple comparisons in itself). It's best to have a statistical plan before you look at your data.

  3. Bonferroni is less powerful than Holm. Holm is less powerful than some other procedures that require more assumptions. Sidak is only a tiny bit more powerful than Bonferroni and requires the assumption of non-negative dependence. If you just want to compare each treatment to control, and not compare the different treatments to each other, you can use Dunnett's procedure (which is designed for that purpose).

  4. Not sure what you mean by "post hoc." Unfortunately, different people use that term in different ways.

  5. Multiplicity applies any time you conduct more than one comparison.

  6. See 5.

  7. If you're not interested in the omnibus result, there's no reason to perform the omnibus test. As you observed, you can just go straight to the individual tests, adjusted for multiplicity (though it may be advisable to use the omnibus error term for those tests, which can provide more power in some cases). Some people perform the omnibus test and then use Fisher's LSD method (i.e. do the individual comparisons without adjustment), but that doesn't generally control the familywise error rate and may thus be hard to justify.

  8. I don't see why the significance of a main effect should inherently affect whether you adjust the other tests.


Response to @Sophocole's reply from Aug 5, 2016 to @Bonferroni's answer from Aug 3, 2016.

I don't know who you talked to at IBM, but SPSS has several ways to control the familywise error rate, including Bonferroni, Tukey, and Dunnett tests (just google "multiple comparisons in SPSS" and you'll see). The same goes for any other reputable statistical package, including SAS and R. And if you're using a simple method like Bonferroni, you can probably do the adjustment in your head.

Regarding doing multiple tests of a single comparison and choosing the one that gives you the answer you like best, it's pretty straightforward to see what the problem with that is. If you try one method that produces error at a rate of 5%, but then you get a second, third, and fourth chance with alternative methods, obviously the error rate is going to be bigger than 5%. That's like playing darts and setting up a second, third, and fourth bull's eye in slightly different positions on the dart board--obviously, you're increasing your chances of getting lucky.

If you're in a very early stage of your research where you're just exploring around and error rates aren't a big concern, then by all means, test your heart out and don't bother with adjustments--you could even just look at the plots and mean differences and not do any formal testing at all if that suits your needs. But if you're trying to publish a claim or sell a treatment based on your results, you likely need statistical rigor. And if you're trying to get a drug approved by the FDA, you can forget about playing loose with error control!

By the way, you may want to read that Nakagawa article again. It seems he is not arguing against "getting rid of multiplicity adjustments altogether." He apparently thinks Bonferroni and Holm are generally too conservative for behavioral ecology research, but he does endorse false discovery rate control.

Related Question