One point that may help your understanding:
If $x$ is normally distributed and $a$ and $b$ are constants, then $y=\frac{x-a}{b}$ is also normally distributed (but with a possibly different mean and variance).
Since the residuals are just the y values minus the estimated mean (standardized residuals are also divided by an estimate of the standard error) then if the y values are normally distributed then the residuals are as well and the other way around. So when we talk about theory or assumptions it does not matter which we talk about because one implies the other.
So for the questions this leads to:
- yes, both, either
- No, (however the individual y-values will come from normals with different means which can make them look non-normal if grouped together)
- Normality of residuals means normality of groups, however it can be good to examine residuals or y-values by groups in some cases (pooling may obscure non-normality that is obvious in a group) or looking all together in other cases (not enough observations per group to determine, but all together you can tell).
- This depends on what you mean by compare, how big your sample size is, and your feelings on "Approximate". The normality assumption is only required for tests/intervals on the results, you can fit the model and describe the point estimates whether there is normality or not. The Central Limit Theorem says that if the sample size is large enough then the estimates will be approximately normal even if the residuals are not.
- It depends on what question your are trying to answer and how "approximate" your are happy with.
Another point that is important to understand (but is often conflated in learning) is that there are 2 types of residuals here: The theoretical residuals which are the differences between the observed values and the true theoretical model, and the observed residuals which are the differences between the observed values and the estimates from the currently fitted model. We assume that the theoretical residuals are iid normal. The observed residuals are not i, i, or distributed normal (but do have a mean of 0). However, for practical purposes the observed residuals do estimate the theoretical residuals and are therefore still useful for diagnostics.
Note that this has nothing at all to do with residuals as such. It applies generally to looking at any distributions.
The two graphs do not have exactly the same purpose. Be clear that a symmetry plot checks for symmetry or asymmetry and would look simple for many symmetric distributions that were not Gaussian, e.g. t distributions with finite degrees of freedom. But there is still a question of whether the graphs contradict each other.
I here assume familiarity with normal probability plots (historically often so named, although Gaussian quantile-quantile plots is a minority preferred name). See for example this explanation.
However, symmetry plots seem less used and bear some explanation.
Stata's symplot
, as the axis titles imply, pairs values above and below the median and plots (largest $-$ median) vs (median $-$ smallest), (second largest $-$ median) vs (median $-$ second smallest), etc. and the reference line is thus (value in upper half $-$ median) $=$ (median $-$ value in lower half), implying symmetry of distribution.
What you can't tell easily from symplot
in cases like this is how many values are in the middle, often approximately symmetric part of the distribution and how many in the rest.
It is easy for symplot
therefore to impart a pessimistic message because points may be heavily overplotted near the middle of the distribution.
Here is another example. I simulate 95% of values from a Gaussian and 5% of values from a gamma with the same variance (but evidently different skew).
This is the Stata recipe used:
clear
set obs 10000
set seed 2803
gen y = cond(_n <= 9500, rnormal(6,10), rgamma(1,10))
symplot y
qnorm y
Loosely, the symplot
seems to flag lack of symmetry (and thus lack of normality) more prominently than the normal probability plot (Gaussian quantile-quantile plot) flags lack of Gaussianity.
It's manifestly the same data, but the tail is inevitably more prominent in one graph than another. In addition to the question of overplotting, in a symmetry plot all the bad news is usually lumped together at one end; in a normal probability plot there is often bad news in both tails.
Best Answer
Not knowing which methods you used to test for the normality of the residuals and the dependent variable, respectively, it's difficult for me to give you an exact answer. However, I assume that you used a visual comparison or some kind of significance test to check for normality.
Since you mentioned that you only have 20 datapoints per group, I think that the problem lies with the sample size. If you use our standard "off-the-shelf" Frequentist test to assess the normality of a group. For example, if you use the Shapiro-Wilk test to check for normality, you are essentially comparing your (standardised and ordered) sample with an ordered sample drawn from a standard normal distribution. If your sample deviates too much from the standard normal distribution, the difference is deemed "significant" (for example on the 0.05 level), giving you a hint that your sample should not be regarded as normally distributed.
But the Shapiro-Wilk test, like most normality test, is highly susceptible to changes in the sample size. If your sample size is too low, it is very hard to detect a difference to a normally distributed sample so most of the test results will be non-significant. If you increase the sample size, however, even small deviations from the normal distributions will turn out to be "significant".
This is probably what happened in your case. When you were using the residuals to test for normality, you had a total of 420 = 80 data points for each variable (height, weight & waist circumference), with the result that two out of three tests turned out to be significant. When you were using the data points within each group, you conducted more tests in total (43 = 12 tests for each variable in each group), but due to the low sample size in each test only 1 out of 12 tests turned out to be significant (waist circumference in group 3)
I hope that helped to clear things up a bit. In order to give you any meaningful recommendation on which methods to use, you need to present more information on your data set and the exact tests you used.