Solved – Normality of dependent variable = normality of residuals

faqnormal distributionnormality-assumptionresiduals

This issue seems to rear its ugly head all the time, and I'm trying to decapitate it for my own understanding of statistics (and sanity!).

The assumptions of general linear models (t-test, ANOVA, regression etc.) include the "assumption of normality", but I have found this is rarely described clearly.

I often come across statistics textbooks / manuals / etc. simply stating that the "assumption of normality" applies to each group (i.e., categorical X variables), and we should we examining departures from normality for each group.

Questions:

does the assumption refer to the values of Y or the residuals of Y?
for a particular group, is it possible to have a strongly non-normal distribution of Y values (e.g., skewed) BUT an approximately (or at least more normal) distribution of residuals of Y?

Other sources describe that the assumption pertains to the residuals of the model (in cases where there are groups, e.g. t-tests / ANOVA), and we should be examining departures of normality of these residuals (i.e., only one Q-Q plot/test to run).
does normality of residuals for the model imply normality of residuals for the groups? In other words, should we just examine the model residuals (contrary to instructions in many texts)?

To put this in a context, consider this hypothetical example:
- I want to compare tree height (Y) between two populations (X).
- In one population the distribution of Y is strongly right-skewed (i.e.,
  most trees short, very few tall), while the other is virtually normal
- Height is higher overall in the normally distributed population (suggesting there may be a 'real' difference).
- Transformation of the data does not substantially improve the distribution of the first population.
Firstly, is it valid to compare the groups given the radically different height distributions?
How do I approach the "assumption of normality" here? Recall height in one population is not normally distributed. Do I examine residuals for both populations separately OR residuals for the model (t-test)?

Please refer to questions by number in replies, experience has shown me people get lost or sidetracked easily (especially me!). Keep in mind I am not a statistician; though I have a reasonably conceptual (i.e., not technical!) understanding of statistics.

P.S., I have searched the archives and read the following threads which have not cemented my understanding:

Best Answer

One point that may help your understanding:

If $x$ is normally distributed and $a$ and $b$ are constants, then $y=\frac{x-a}{b}$ is also normally distributed (but with a possibly different mean and variance).

Since the residuals are just the y values minus the estimated mean (standardized residuals are also divided by an estimate of the standard error) then if the y values are normally distributed then the residuals are as well and the other way around. So when we talk about theory or assumptions it does not matter which we talk about because one implies the other.

So for the questions this leads to:

yes, both, either
No, (however the individual y-values will come from normals with different means which can make them look non-normal if grouped together)
Normality of residuals means normality of groups, however it can be good to examine residuals or y-values by groups in some cases (pooling may obscure non-normality that is obvious in a group) or looking all together in other cases (not enough observations per group to determine, but all together you can tell).
This depends on what you mean by compare, how big your sample size is, and your feelings on "Approximate". The normality assumption is only required for tests/intervals on the results, you can fit the model and describe the point estimates whether there is normality or not. The Central Limit Theorem says that if the sample size is large enough then the estimates will be approximately normal even if the residuals are not.
It depends on what question your are trying to answer and how "approximate" your are happy with.

Another point that is important to understand (but is often conflated in learning) is that there are 2 types of residuals here: The theoretical residuals which are the differences between the observed values and the true theoretical model, and the observed residuals which are the differences between the observed values and the estimates from the currently fitted model. We assume that the theoretical residuals are iid normal. The observed residuals are not i, i, or distributed normal (but do have a mean of 0). However, for practical purposes the observed residuals do estimate the theoretical residuals and are therefore still useful for diagnostics.

Best Answer

Related Solutions

Solved – ANOVA assumption normality/normal distribution of residuals

Solved – Normality of residuals vs sample data; what about t-tests

Related Question