Solved – ANOVA vs multiple linear regression? Why is ANOVA so commonly used in experimental studies

anovaleast squaresmultiple regression

ANOVA vs multiple linear regression?

I understand that both of these methods seem to use the same statistical model. However under what circumstances should I use which method?

What are the advantages and disadvantages of these methods when compared?

Why is ANOVA so commonly used in experimental studies and I hardly ever find a regression study?

Best Answer

It would be interesting to appreciate that the divergence is in the type of variables, and more notably the types of explanatory variables. In the typical ANOVA we have a categorical variable with different groups, and we attempt to determine whether the measurement of a continuous variable differs between groups. On the other hand, OLS tends to be perceived as primarily an attempt at assessing the relationship between a continuous regressand or response variable and one or multiple regressors or explanatory variables. In this sense regression can be viewed as a different technique, lending itself to predicting values based on a regression line.

However, this difference does not stand the extension of ANOVA to the rest of the analysis of variance alphabet soup (ANCOVA, MANOVA, MANCOVA); or the inclusion of dummy-coded variables in the OLS regression. I'm unclear about the specific historical landmarks, but it is as if both techniques have grown parallel adaptations to tackle increasingly complex models.

For example, we can see that the differences between ANCOVA versus OLS with dummy (or categorical) variables (in both cases with interactions) are cosmetic at most. Please excuse my departure from the confines in the title of your question, regarding multiple linear regression.

In both cases, the model is essentially identical to the point that in R the lm function is used to carry out ANCOVA. However, it can be presented as different with regards to the inclusion of an intercept corresponding to the first level (or group) of the factor (or categorical) variable in the regression model.

In a balanced model (equally sized $i$ groups, $n_{1,2,\cdots\, i}$) and just one covariate (to simplify the matrix presentation), the model matrix in ANCOVA can be encountered as some variation of:

$$X=\begin{bmatrix} 1_{n_1} & 0 & 0 & x_{n_1} & 0 & 0\\ 0 & 1_{n_2} & 0 & 0 & x_{n_2} & 0\\ 0 & 0 & 1_{n_3} & 0 & 0 & x_{n_3} \end{bmatrix}$$

for $3$ groups of the factor variable, expressed as block matrices.

This corresponds to linear model:

$$y = \alpha_i + \beta_1\, x_{n_1}+ \beta_2\,x_{n_2} \,+ \beta_3\,x_{n_3}\,+ \epsilon_i$$ with $\alpha_i$ equivalent to the different group means in an ANOVA model, while the different $\beta$'s are the slopes of the covariate for each one of the groups.

The presentation of the same model in the regression field, and specifically in R, considers an overall intercept, corresponding to one of the groups, and the model matrix could be presented as:

$$X=\begin{bmatrix} \color{red}\vdots & 0 & 0 &\color{red}\vdots & 0 &0 & 0\\ \color{red}{J_{3n,1}} & 1_{n_2} & 0 & \color{red}{x} & 0 & x_{n_2} & 0\\ \color{red}\vdots& 0 & 1_{n_3} & \color{red}\vdots & 0 & 0 & x_{n_3} \end{bmatrix}$$

of the OLS equation:

$$y =\color{red}{\beta_0} + \mu_i +\beta_1\, x_{n_1}+ \beta_2\,x_{n_2} \,+ \beta_3\,x_{n_3}\,+ \epsilon_i$$.

In this model, the overall intercept $\beta_0$ is modified at each group level by $\mu_i$, and the groups also have different slopes.

As you can see from the model matrices, the presentation belies the actual identity between regression and analysis of variance.

I like to kind of verify this with some lines of code and my favorite data set mtcars in R. I am using lm for ANCOVA according to Ben Bolker's paper available here.

mtcars$cyl <- as.factor(mtcars$cyl)         # Cylinders variable into factor w 3 levels
D <- mtcars  # The data set will be called D.
D <- D[order(D$cyl, decreasing = FALSE),]   # Ordering obs. for block matrices.

model.matrix(lm(mpg ~ wt * cyl, D))         # This is the model matrix for ANCOVA

As to the part of the question about what method to use (regression with R!) you may find amusing this on-line commentary I came across while writing this post.

Related Question