Solved – How to test for normality in a 2×2 ANOVA

anovaassumptionsnormality-assumptionresidualsspss

Study Design: I showed participants some information about sea-level rise, focusing the information in different ways, both in terms of the time-scale and the magnitude of potential rise. Thus I had a 2 (Time: 2050 or 2100) by 2 (Magnitude: Medium or High) design. There were also two control groups who received no information, only answering the questions for my DVs.

Questions:
I've always checked for normality within cells — for the 2×2 portion of this design, it would mean looking for normality within 4 groups. However, reading some discussions here has made me second guess my methods.

First, I've read that I should be looking at the normality of the residuals. How can I check for normality of residuals (in SPSS or elsewhere)? Do I have to do this for each of the 4 groups (6 including the controls)?

I also read that normality within groups implies normality of the residuals. Is this true? (Literature references?) Again, does this mean looking at each of the 4 cells separately?

In short, what steps would you take to determine whether your (2×2) data are not violating assumptions of normality?

References are always appreciated, even if just to point me in the right direction.

Best Answer

Most statistics packages have ways of saving residuals from your model. Using GLM - UNIVARIATE in SPSS you can save residuals. This will add a variable to your data file representing the residual for each observation.

Once you have your residuals you can then examine them to see whether they are normally distributed, homoscedastic, and so on. For example, you could use a formal normality test on your residual variable or perhaps more appropriately, you could plot the residuals to check for any major departures from normality. If you want to examine homoscedasticity, you could get a plot that looked at the residuals by group.

For a basic between subjects factorial ANOVA, where homogeneity of variance holds, normality within cells means normality of residuals because your model in ANOVA is to predict group means. Thus, the residual is just the difference between group means and observed data.

Response to comments below:

Residuals are defined relative to your model predictions. In this case your model predictions are your cell means. It is a more generalisable way of thinking about assumption testing if you focus on plotting the residuals rather than plotting individual cell means, even if in this particular case, they are basically the same. For example, if you add a covariate (ANCOVA), residuals would be more appropriate to examine than distributions within cells.
For purposes of examining normality, standardised and unstandardised residuals will provide the same answer. Standardised residuals can be useful when you are trying to identify data that is poorly modelled by the data (i.e., an outlier).
Homogeneity of variance and homoscedasticity mean the same thing as far as I'm aware. Once again, it is common to examine this assumption by comparing the variances across groups/cells. In your case, whether you calculate variance in residuals for each cell or based on the raw data in each cell, you will get the same values. However, you can also plot residuals on the y-axis and predicted values on the x-axis. This is a more generalisable approach as it is also applicable to other situations such as where you add covariates or you are doing multiple regression.
A point was raised below that when you have heteroscedasticity (i.e., within cell variance varies between cells in the population) and normally distributed residuals within cells, the resulting distribution of all residuals would be non-normal. The result would be a mixture distribution of variables with mean of zero and different variances with proportions relative to cell sizes. The resulting distribution will have no zero skew, but would presumably have some amount of kurtosis. If you divide residuals by their corresponding within-cell standard deviation, then you could remove the effect heteroscedasticity; plotting the residuals that result would provide an overall test of whether residuals are normally distributed independent of any heteroscedasticity.

Related Solutions

Solved – How to analyze a 2×3 design in which one level doesn’t differ on one of the factors

One strategy would be to see your design as containing 5 groups, which we could label:

C, M2050, M2100, E2050, E2100

You could then set up various planned contrasts that examine questions of interest.

Here are some example contrast weights for testing various research questions:

C    M2050   M2100  E2050  E2100  Comparison
+4   -1      -1     -1     -1     Control versus other
0    +1      +1     -1     -1     Moderate versus extreme
0    +1      -1     +1     -1     2050 versus 2100

You could achieve something similar, by first testing whether control is different to the average of the other four groups and then performing the $2\times2$ ANOVA omitting the control group. The default tests in the $2\times2$ are likely to correspond to many of the planned comparisons you would do anyway (i.e., M versus E, 2050 versus 2010, and the interaction between amount and duration). However, the contrast approach might be slightly more powerful because your error variance may be smaller (its based on deviations from the four non-control group means rather than one overall non-control mean).

In previous questions, you've asked about SPSS, so here's an example of testing a contrast using GLM in SPSS. And here's a lecture on how to do it using R.

Solved – Normality of dependent variable = normality of residuals

One point that may help your understanding:

If $x$ is normally distributed and $a$ and $b$ are constants, then $y=\frac{x-a}{b}$ is also normally distributed (but with a possibly different mean and variance).

Since the residuals are just the y values minus the estimated mean (standardized residuals are also divided by an estimate of the standard error) then if the y values are normally distributed then the residuals are as well and the other way around. So when we talk about theory or assumptions it does not matter which we talk about because one implies the other.

So for the questions this leads to:

yes, both, either
No, (however the individual y-values will come from normals with different means which can make them look non-normal if grouped together)
Normality of residuals means normality of groups, however it can be good to examine residuals or y-values by groups in some cases (pooling may obscure non-normality that is obvious in a group) or looking all together in other cases (not enough observations per group to determine, but all together you can tell).
This depends on what you mean by compare, how big your sample size is, and your feelings on "Approximate". The normality assumption is only required for tests/intervals on the results, you can fit the model and describe the point estimates whether there is normality or not. The Central Limit Theorem says that if the sample size is large enough then the estimates will be approximately normal even if the residuals are not.
It depends on what question your are trying to answer and how "approximate" your are happy with.

Another point that is important to understand (but is often conflated in learning) is that there are 2 types of residuals here: The theoretical residuals which are the differences between the observed values and the true theoretical model, and the observed residuals which are the differences between the observed values and the estimates from the currently fitted model. We assume that the theoretical residuals are iid normal. The observed residuals are not i, i, or distributed normal (but do have a mean of 0). However, for practical purposes the observed residuals do estimate the theoretical residuals and are therefore still useful for diagnostics.

Best Answer

Response to comments below:

Related Solutions

Solved – How to analyze a 2×3 design in which one level doesn’t differ on one of the factors

Solved – Normality of dependent variable = normality of residuals

Related Question