ANOVA Assumptions – Importance of Equality of Variance and Normality of Residuals

anovaassumptionshypothesis testing

When running an ANOVA we are told that certain assumptions of the test must be present for it to be applicable to the data. I never understood the reason as to why the following assumptions were necessary for the test to function:

  1. The variance of your dependent variable (residuals) should be equal in each cell of the design

  2. Your dependent variable (residuals) should be approximately normally distributed for each cell of the design

I understand that there is a bit of a grey area as to if these assumptions need to be met, but for the sake of argument, if these assumptions were utterly not met in a given data set, what would be the issue with using an ANOVA?

Best Answer

The assumptions matter insofar as they affect the properties of the hypothesis tests (and intervals) you might use whose distributional properties under the null are calculated relying on those assumptions.

In particular, for hypothesis tests, the things we might care about are how far the true significance level might be from what we want it to be, and whether power against alternatives of interest is good.

In relation to the assumptions you ask about:

1. Equality of variance

The variance of your dependent variable (residuals) should be equal in each cell of the design

This can certainly impact the significance level, at least when sample sizes are unequal.

(Edit:) An ANOVA F-statistic is the ratio of two estimates of variance (the partitioning and comparison of variances is why it's called analysis of variance). The denominator is an estimate of the supposedly-common-to-all-cells error variance (calculated from residuals), while the numerator, based on variation in the group means, will have two components, one from variation in the population means and one due to the error variance. If the null is true, the two variances that are being estimated will be the same (two estimates of the common error variance); this common but unknown value cancels out (because we took a ratio), leaving an F-statistic that only depends on the distributions of the errors (which under the assumptions we can show has an F distribution. (Similar comments apply to the t-test I used for illustration.)

[There's a little bit more detail on some of that information in my answer here]

However, here the two population variances differ across the two differently-sized samples. Consider the denominator (of the F-statistic in ANOVA and of the t-statistic in a t-test) - it is composed of two different variance estimates, not one, so it will not have the "right" distribution (a scaled chi-square for the F and its square root in the case of a t - both the shape and the scale are issues).

As a result, the F-statistic or the t-statistic will no longer have the F- or t-distribution, but the manner in which it is affected is different depending on whether the large or the smaller sample was drawn from the population with the larger variance. This in turn affects the distribution of p-values.

Under the null (i.e. when the population means are equal), the distribution of p-values should be uniformly distributed. However, if the variances and the sample sizes are unequal but the means are equal (so we don't want to reject the null), the p-values are not uniformly distributed. I did a small simulation to show you what happens. In this case, I used only 2 groups so ANOVA is equivalent to a two-sample t-test with the equal variance assumption. So I simulated samples from two normal distributions one with standard deviation ten times as large as the other, but equal means.

For the left side plot, the larger (population) standard deviation was for n=5 and the smaller standard deviation was for n=30. For the right side plot the larger standard deviation went with n=30 and the smaller with n=5. I simulated each one 10000 times and found the p-value each time. In each case you want the histogram to be completely flat (rectangular), since this means all tests conducted at some significance level $\alpha$ with actually get that type I error rate. In particular it's most important that the leftmost parts of the histogram to stay close to the grey line:

Histograms of p-values for simulated samples

As we see, the left side plot (larger variance in the smaller sample) the p-values tend to be very small -- we would reject the null hypothesis very often (nearly half the time in this example) even though the null is true. That is, our significance levels are much larger than we asked for. In the right hand side plot we see the p-values are mostly large (and so our significance level is much smaller than we asked for) -- in fact not once in ten thousand simulations did we reject at the 5% level (the smallest p-value here was 0.055). [This may not sound like such a bad thing, until we remember that we will also have very low power to go with our very low significance level.]

That's quite a consequence. This is why it's a good idea to use a Welch-Satterthwaite type t-test or ANOVA when we don't have a good reason to assume that the variances will be close to equal -- by comparison it's barely affected in these situations (I simulated this case as well; the two distributions of simulated p-values - which I have not shown here - came out quite close to flat).

2. Conditional distribution of the response (DV)

Your dependent variable (residuals) should be approximately normally distributed for each cell of the design

This is somewhat less directly critical - for moderate deviations from normality, the significance level is so not much affected in larger samples (though the power can be!).

Here's one example, where the values are exponentially distributed (with identical distributions and sample sizes), where we can see this significance level issue being substantial at small $n$ but reducing with large $n$.

histograms of p-values when H0 is true but the distributions in each group are exponential

We see that at n=5 there are substantially too few small p-values (the significance level for a 5% test would be about half what it should be), but at n=50 the problem is reduced -- for a 5% test in this case the true significance level is about 4.5%.

So we might be tempted to say "well, that's fine, if n is big enough to get the significance level to be pretty close", but we may also be throwing a way a good deal of power. In particular, it's known that the asymptotic relative efficiency of the t-test relative to widely used alternatives can go to 0. This means that better test choices can get the same power with a vanishingly small fraction of the sample size required to get it with the t-test. You don't need anything out of the ordinary to be going on to need more than say twice as much data to have the same power with the t as you would need with an alternative test - moderately heavier-than normal tails in the population distribution and moderately large samples can be enough to do it.

(Other choices of distribution may make the significance level higher than it should be, or substantially lower than we saw here.)

Related Question