Solved – the worst that can happen when the homoscedasticity assumption is violated in ANOVA

anovaassumptionsheteroscedasticity

This is a follow-up question I have after reviewing this post: Difference in means statistical test for non-normal, heteroscedastic data?

To be clear, I am asking from a pragmatic perspective (not to suggest that theoretical responses are not welcome). When normality among groups is present (different from the title of the question referenced above), but the group variances are substantively different, what is the worst that a researcher might observe?

In my experience, the issue that arises the most with this scenario is "strange" patterns in the post hoc comparisons. (This has been observed both in my published work, but also in pedagogic settings…happy to provide details of this in the comments below.) What I have observed is something akin to this: You have three groups with $M_1 < M_2 < M_3$. The (omnibus) ANOVA gives $p<\alpha$, and the pairwise $t$-tests suggest $M_2$ is statistically significantly different from the other two groups…but $M_1$ and $M_3$ are not statistically significantly different. Part of my question is if this is what others have observed, but also, what other issues have you observed with comparable scenarios?

A quick review of my reference texts suggest ANOVA is rather robust to mild to moderate violations of the homoscedasticity assumption, and even more so with large sample sizes. However, these references do not specifically state (1) what could go wrong or (2) what might happen with a large number of groups.

Best Answer

Group comparisons of means based on the general linear model are often said to be generally robust to violations of the homogeneity of variance assumption. There are, however, certain conditions under which this is definitely not the case, and a relatively simple one is a situation where the homogeneity of variance assumption is violated and you have disparities in group sizes. This combination can increase your Type I or Type II error rate, depending on the distribution of disparities in variances and sample sizes across groups.

A series of simple simulations of $p$-values will show you how. First, let's look at how a distribution $p$-values should look like when the null is true, the homogeneity of variance assumption is met, and group sizes are equal. We will simulate equal standardized scores for 200 observations in two groups (x and y), run a parametric $t$-test, and save the resulting $p$-value (and repeat this 10,000 times). We will then plot a histogram of the simulated $p$-values:

nSims <- 10000
h0 <-numeric(nSims)

for(i in 1:nSims){ 
x<-rnorm(n = 200, mean = 0, sd = 1) 
y<-rnorm(n = 200, mean = 0, sd = 1)  
z<-t.test(x,y, var.equal = T) 
h0[i]<-z$p.value 
}

hist(h0, main="Histogram of p-values [H0 = T, HoV = T, Cell.Eq = T]", xlab=("Observed p-value"), breaks=100)

enter image description here

The distribution of $p$-values is relatively uniform, as it should be. But what if we make group y's standard deviation 5 times as large as group x's (i.e., homogeneity of variance is violated)?

enter image description here

Still pretty uniform. But when we combine violated homogeneity of variance assumption with disparities in group size (now decreasing group x's sample size to 20), we run into major problems. enter image description here

The combination of a larger standard deviation in one group and a smaller group size in the other produces a rather dramatic inflation in our Type I error rate. But disparities in both can work the other way too. If, instead, we specify a population where the null is false (group x's mean is .4 instead of 0), and one group (in this case, group y) has both a larger standard deviation and the larger sample size, then we can actually hurt our power to detect a real effect:

enter image description here

So in summary, homogeneity of variance isn't a huge problem when group sizes are relatively equal, but when group sizes are unequal (as they might be in many areas of quasi-experimental research), homogeneity of variance can really inflate your Type I or II error rates.

Related Question