ANOVA – Interpreting R^2 and F-Ratio in One-way ANOVA

anovar-squaredvariance

In my stats class, for One-Way ANOVAs, we are taught that the F ratio represents the ratio of explained to unexplained variance.

We are also taught that $R^2$ can be interpreted as the percentage of variation in the dependent variable that is explained by the independent variable.

I have an ANOVA output that has an $R^2$ value of .09, and a F ratio of 2.6.
How is it the case that the model explains only 9% if the variance in the dependent variable, and yet there is 2 times as much explained variance than unexplained variance?

I know I must be missing something obvious. I am in a psychology class so we don't go over the math behind the scenes- just the intuition and interpretation behind the analyses.

Best Answer

An ANOVA model can be stated as follows: $$y_{ij}=\mu_i+\epsilon_{ij}$$ where $y_{ij}$ is the value of the response variable in the $j$th trial for the $i$th treatment, $i=1,\dots,r$, $j=1,\dots,n$.

At first you determine whether or not the treatment means are the same. The total variability of the $y_{ij}$ observations is measured in terms of the total deviation of each observation: $$y_{ij}-\overline{y}_{..}=(\overline{y}_{i.}-\overline{y}_{..})+(y_{ij}-\overline{y}_{i.})$$ where $\overline{y}_{..}$ is the overall mean, $\overline{y}_{i.}-\overline{y}_{..}$ is the deviation of treatment means around overall mean, and $y_{ij}-\overline{y}_{i.}$ is the deviation around treatment means. Squaring and summing you get: \begin{align*} SSTO &= \sum_i\sum_j(y_{ij}-\overline{y}_{..})^2&\text{(total sum of squares)}\\ SSTR &= \sum_i n_i(\overline{y}_{i.}-\overline{y}_{..})^2&\text{(treatment sum of squares)} \\ SSE &= \sum_i\sum_j(y_{ij}-\overline{y}_{i.})^2&\text{(error sum of squares)}\\ SSTO&=SSTR+SSE \end{align*}

$SSTO$ has $n_T-1$ degrees of freedom, where $n_T$ is the total number of observations. $SSTR$ has $r-1$ degrees of freedom, where $r$ is the number of treatment levels. $SSE$ has $n_T-r$ degrees of freedom. The $F$ ratio is: $$F^*=\frac{MSTR}{MSE},\qquad MSTR=\frac{SSTR}{r-1},MSE=\frac{SSE}{n_T-r}$$ Large values of $F^*$ support the hypothesis that not all $\mu_i$ are equal, i.e. that a significant percentage of variation is explained by the deviation of treatment means around overall mean.

If the treatments levels are quantitative independent variables, no assumption is made in analysis of variance models about the nature of the statistical relation betwem them and the response variable, but you can specify a regression function and perform a regression analysis.

In a regression analysis you are interested in a statistical relation between independent and dependent variables, not in the difference between means. So you have: \begin{align*} SSTO&=\sum_i(y_i-\overline{y})^2&\text{(total deviation)} \\ SSR&=\sum_i(\hat{y}_i-\overline{y})^2&\text{(regression sum of squares)}\\ SSE&=\sum_i(y_i-\hat{y}_i)^2&\text{(residual sum of squares)}\\ SSTO&=SSR+SSE \end{align*} where $\hat{y}_i$ is the fitted value of $y_i$, i.e. the value of $y_i$ net of the error $\epsilon_i$ (the expected value of $y_i$) according to the statistical relation you have assumed. $R$ squared is defined as: $$R^2=\frac{SSR}{SSTO}=1-\frac{SSE}{SSTO}$$ Large values of $R^2$ support the hypothesis that between independent and dependent variables there is a (linear) relation close to the one you have assumed.

An example in R.

> treatment <- c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
> response <- c(242,245,244,248,247,248,242,244,246,242,248,246,245,247,248,250,247,246,243,244,246,248,250,252,248,250,246,248,245,250)
> dat <- data.frame(treatment, response)
> dat
   treatment response
1          1      242
2          1      245
3          1      244
4          1      248
5          1      247
6          1      248
7          1      242
8          1      244
9          1      246
10         1      242
11         2      248
12         2      246
13         2      245
14         2      247
15         2      248
16         2      250
17         2      247
18         2      246
19         2      243
20         2      244
21         3      246
22         3      248
23         3      250
24         3      252
25         3      248
26         3      250
27         3      246
28         3      248
29         3      245
30         3      250

The $F$ ratio:

> summary(aov(response ~ treatment, data=dat))
            Df Sum Sq Mean Sq F value Pr(>F)   
treatment    1  61.25   61.25   12.78 0.0013 **
Residuals   28 134.25    4.79                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

supports the hypothesis that the treatments means:

> aggregate(dat, list(dat$treatment), mean)
  Group.1 treatment response
1       1         1    244.8
2       2         2    246.4
3       3         3    248.3

are different. But $R^2$ is small:

> summary(lm(response ~ treatment, data=dat))

Call:
lm(formula = response ~ treatment, data = dat)

Residuals:
   Min     1Q Median     3Q    Max 
-3.500 -2.062 -0.250  1.688  3.750 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 243.0000     1.0577 229.742   <2e-16 ***
treatment     1.7500     0.4896   3.574   0.0013 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.19 on 28 degrees of freedom
Multiple R-squared:  0.3133,    Adjusted R-squared:  0.2888 
F-statistic: 12.77 on 1 and 28 DF,  p-value: 0.001299

Indeed, there is a weak linear relation between treatment levels and response:

enter image description here

Let me suggest Kutner, Nachtsheim, Neter, and Li, Applied Linear Statistical Models. It is a very approachable book and can also be used as a reference. Don't be frightened by the page count :)