Solved – Role of Central Limit Theorem in one-way ANOVA

anovacentral limit theoremnormality-assumptionsample-size

Background: It has been shown and widely referenced (applets even exist, etc.) that for even a highly-skewed numeric variable, a sample size of $n\ge{}$30 is often "large enough" for the Central Limit Theorem (CLT) to take effect, and thus for the distribution of the sample mean to be considered normal for the purposes of inference. Some sources do suggest $n$ = 40 or even 50 for very highly-skewed data, however.

Query: It seems then, that for the one-way ANOVA, this suggestion should apply to each factor level (group) as well, since the data in each factor level need to be individually normally distributed (or the sample size "large enough") to meet the normality assumption of the one-way ANOVA. I'm wondering if anyone has any references that address this specifically.

I'm mostly curious because this post from Minitab (linked) says simulation studies have shown for:

  • 2-9 groups, $n \ge$ 15 for each group is sufficient
  • 10-12 groups, $n \ge$ 20 for each group is sufficient

The wording is somewhat vague, but the author implies these values are when the data are highly skewed (and this is when we would care anyway). No citation is given and I'm having quite the time finding other explicit discussions of the CLT in the framework of a one-way ANOVA.

I've found one source that suggests the 30 per group cut-off, but it's a text (preview given at link below) and I have no way to track down the reference they used for this statement: Biostatistics: The Bare Essentials (Norman & Streiner): "From the Central Limit Theorem (Chapter 4), the means will be normally distributed, regardless of the original distribution, especially when there are at least 30 or so observations per group."

Sources I've tried to no avail:

  • Design and Analysis of Experiments (8th Ed.), Montgomery: Doesn't discuss CLT for a one-way ANOVA
  • Design of Experiments: Statistical Principles of Research Design and Analysis (2nd Ed.), Kuehl: Doesn't discuss
  • Statistical Inference (2nd Ed.), Casella & Berger (pg. 524): "Of course, with reasonable sample sizes and populations that are not too asymmetric, we have the Central Limit Theorem (CLT) to rely on."
  • Extensive internet search that gives various suggestions, including sometimes for the ANOVA overall (overall sample size across all factor levels), which doesn't make sense to me (doesn't seem this would alleviate the issue in any one group): No references provided by any sources

Summary: I feel that to be "safe," if any given factor level in a one-way ANOVA is notably skewed, we should have at least a sample size of 30 in that factor level to ensure the CLT has taken effect. However, I would appreciate references that confirm this.

Best Answer

That is not a correct interpretation of the CLT. The CLT is a limiting argument and only helps you with respect to type I error, not type II error. Confidence intervals using the CLT can be horrendously inaccurate for sample sizes in the thousands when the data distribution is very skewed (e.g., lognormal distribution). If the 2-sample $t$-test doesn't work in the face of much asymmetry, ANOVA will not fare any better. Note also that the CLT in how it's usually invoked only works when the population variance is known.

Rand Wilcox has written nice papers showing that with data distributions that are only slightly non-normal the distribution of the $t$ statistic can be very far from the $t$ distribution. Note that it is not relevant that critical values of the $t$ distribution are very close to $z$ critical values for $n\geq 20$ or $30$; these both pertain to Gaussian data.

Perhaps the easiest way to understand why the CLT is irrelevant to the real world is to remember that the standard deviation is only a very good measure of dispersion if the data distribution is symmetric and not too heavy tailed, and to note that a single outlier can destroy the standard deviation estimate.

Related Question