If I have 3 groups that I want to compare with a t-test (correcting for multiple comparisons afterwards) but one of the groups has different variance than the other two, should I use a Welch correction for all comparisons or just for the comparison involving the group with unequal variance?
Solved – Multiple comparisons with t-test: one group with different variance
multiple-comparisonst-testvariance
Related Solutions
One of the issues of inference that arises in event history models is that hazard functions and survival functions in different groups can cross each other at different points in time. For example both the following conditions can be true:
- Those individuals in group A who experience the event (i.e. who "do not survive") may do so relatively quickly, while individuals in group B who experience the event take longer to do so.
- The overall survival in group A may be higher than in group B.
So when you ask about wanting to make comparisons among groups what specifically do you want to compare? Median survival time? The hazard at time t? The survival at time t? The time until survival "flattens" (for some meaning of "flatten")? Something else?
Once you have a well-formulated question about what you would like to compare, multiple comparisons adjustments make sense. Some cases (comparisons at each point in time t, for example) might make the definition of family in the FWER multiple comparison adjustment methods problematic, which might incline one towards the FDR methods, since they scale/do not rely on a definition of family.
Nice question! Let's clear up some potential confusion, first. Dunn's test (Dunn, 1964) is precisely that: a test statistic which is a nonparametric analog to the pairwise t test one would conduct post hoc to an ANOVA. It is similar to the Mann-Whitney-Wilcoxon rank sum test, except that (1) it employs a measure of the pooled variance that is implied by the null hypothesis of the Kruskal-Wallis test, and (2) it uses the same rankings of one's original data as are used by the Kruskal-Wallis test.
Dunn also developed what is commonly referred to as the Bonferroni adjustment for multiple comparisons (Dunn, 1961), which is one of many methods to control the family-wise error rate (FWER) that have since been developed, and simply entails dividing $\alpha$ (one-tailed tests) or $\alpha/2$ (two-tailed tests) by the number of pairwise comparisons one is making. The maximum number of pairwise comparisons one may make with $k$ variables is $k(k-1)/2$, so that's 17*16/2=136 possible pairwise comparisons, implying that you might be able to reject a null hypothesis for any single test if $p \le \alpha/2/136$. Your concern about power is therefore warranted for this method.
Other methods to control the FWER exist with more statistical power however. For example, the Holm and Holm-Sidak stepwise methods (Holm, 1979) do not hemorrhage power the way the Bonferroni method does. There too, you could aim to control the false discovery rate (FDR) instead, and these methods—the Benjamini-Hochberg (1995), and Benjamini-Yekutieli (2001)—generally give more statistical power by assuming that some null hypotheses are false (i.e. by building the idea that that not all rejections are false rejections into sequentially modified rejection criteria). These and other multiple comparisons adjustments are implemented specifically for Dunn's test in Stata in the dunntest package (within Stata type net describe dunntest, from(https://alexisdinno.com/stata)
), and in R in the dunn.test package.
In addition, there is an alternative to Dunn's test (which is based on an approximate z test statistic): the Conover-Iman (exclusively) post hoc to a rejected Kruskal-Wallis test (which is based on a t distribution, and which is more powerful than Dunn's test; Conover & Iman, 1979; Convover, 1999). One can also use the methods to control the FWER or the FDR with the Conover-Iman tests, which is implemented for Stata in the conovertest package (within Stata type net describe conovertest, from(https://alexisdinno.com/stata)
), and for R in the conover.test package.
References
Benjamini, Y. and Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289–300.
Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29(4):1165–1188.
Conover, W. J. (1999). Practical Nonparametric Statistics. Wiley, Hoboken, NJ, 3rd edition.
Conover, W. J. and Iman, R. L. (1979). On multiple-comparisons procedures. Technical Report LA-7677-MS, Los Alamos Scientific Laboratory.
Dunn, O. J. (1961). Multiple comparisons among means. Journal of the American Statistical Association, 56(293):52–64.
Dunn, O. J. (1964). Multiple comparisons using rank sums. Technometrics, 6(3):241–252.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(65-70):1979.
Best Answer
You don't say how you plan to correct for multiple comparisons, but the implication ("afterwards") is that you will use a method based on p-values such as False Discovery Rate or others. If that is the case then it does not matter which t-test you use as long as you are comfortable with the assumptions going into the tests. So you can do the Welch test on all of them to be consistent, or if you prefer you can use the pooled test on the 2 that you believe to have the same variance. You could even combine a p-value from an ANOVA, another from a regression, one from a test of proportions, etc.
Why not try it both ways and see how they compare, if all the assumptions hold, then you should not see much of a difference.
You could further prove this to yourself (or find a counter-example if not true) by simulating the process and trying both ways, running it many times and seeing how the 2 methods compare.