Hypothesis Testing – Relationship Between ANOVA for Comparing Means and Nested Models

anovaf-testhypothesis testingmodel comparisonnested-models

I've so far seen ANOVA used in two ways:

First, in my introductory statistics text, ANOVA was introduced as a way to compare means of three or more groups, as an improvement over pairwise comparison, in order to determine if one of the means has a statistically significant difference.

Second, in my statistical learning text, I've seen ANOVA used to compare two (or more) nested models in order to determine if Model 1, which uses a subset of Model 2's predictors, fits the data equally well, or if the full Model 2 is superior.

Now I assume that in some way or another these two things are actually very similar because they're both using the ANOVA test, but on the surface they seem quite different to me. For one, the first use compares three or more groups, while the second method can be used to compare only two models. Would someone please mind elucidating the connection between these two uses?

Best Answer

In my understanding, the abstract intuition of ANOVA is the following: One decomposes the sources of variance of the observed variable in various directions and investigates the respective contributions. To be more precise, one decomposes the identity map into a sum of projections and investigates which projections/directions make an important contribution to explaining the variance and which do not. The theoretical basis is Cochran's theorem.

To be less abstract, I cast the second form mentioned by the OP into the framework just described. Subsequently, I interpret the first form as a special case of the second one.

Let us consider a regression model with $K$ explanatory variables (the full model) and compare it to the restricted model with $K-J$ variables. WLOG, the last $J$ variables of the full model are not included in the restricted model. The question answered by ANOVA is

"Can we explain significantly more variance in the observed variable if we include $J$ additional variables"?

This question is answered by comparing the variance contributions of the first $K-J$ variables, the next $J$ variables, and the remainder/unexplained part (the residual sum of squares). This decomposition (obtained e.g. from Cochran's theorem) is used to construct the F-test. Thus, one analyses the reduction (by including more variables) in the residual sum of squares of the restricted model (corresponding to the $H_0:$ all coefficients pertaining to the last $J$ variables are zero) by including more variables and obtains the F-statistic $$ \frac{ \frac{RSS_{restr} - RSS_{full}}{J} }{ \frac{RSS_{full}}{N-K} }$$ If the value is large enough, then the variance explained by the additional $J$ variables is significant.

Now, the first form mentioned by the OP is interpreted as a special case of the second form. Consider three different groups A, B, and C with means $\mu_A$, $\mu_B$, and $\mu_C$. The $H_0: \mu_A = \mu_B = \mu_C$ is tested by comparing the variance explained by the regression on an intercept (the restricted model) with the variance explained by the full model containing an intercept, a dummy for group A, and a dummy for group B. The resulting F-statistic $$ \frac{ \frac{RSS_{intercept} - RSS_{dummies}}{2} }{ \frac{RSS_{dummies}}{N-3} }$$ is equivalent to the ANOVA-test on Wikipedia. The denominator is equal to the variation within the groups, the numerator is equal to the variation between the groups. If the variation between the groups is larger than the variation within the groups, one rejects the hypothesis that all means are equal.

Related Question