I want to run an ANOVA test. I am therefore testing for normality. I have tested each group and the residuals (group together)for normality. My data sample does not look approximately normal. However I have an outlier (5,95 SD from the mean). This is still a true value, not due to wrong data entering. When I am deleting this number, the data sample looks close to a normal distribution. How should I deal with this value? Is it best to use a non-parametric test? A transformation? Can I just remove the value?
Solved – Should I remove the outlier
anovanonparametricnormal distributionoutliersstandard deviation
Related Solutions
One point that may help your understanding:
If $x$ is normally distributed and $a$ and $b$ are constants, then $y=\frac{x-a}{b}$ is also normally distributed (but with a possibly different mean and variance).
Since the residuals are just the y values minus the estimated mean (standardized residuals are also divided by an estimate of the standard error) then if the y values are normally distributed then the residuals are as well and the other way around. So when we talk about theory or assumptions it does not matter which we talk about because one implies the other.
So for the questions this leads to:
- yes, both, either
- No, (however the individual y-values will come from normals with different means which can make them look non-normal if grouped together)
- Normality of residuals means normality of groups, however it can be good to examine residuals or y-values by groups in some cases (pooling may obscure non-normality that is obvious in a group) or looking all together in other cases (not enough observations per group to determine, but all together you can tell).
- This depends on what you mean by compare, how big your sample size is, and your feelings on "Approximate". The normality assumption is only required for tests/intervals on the results, you can fit the model and describe the point estimates whether there is normality or not. The Central Limit Theorem says that if the sample size is large enough then the estimates will be approximately normal even if the residuals are not.
- It depends on what question your are trying to answer and how "approximate" your are happy with.
Another point that is important to understand (but is often conflated in learning) is that there are 2 types of residuals here: The theoretical residuals which are the differences between the observed values and the true theoretical model, and the observed residuals which are the differences between the observed values and the estimates from the currently fitted model. We assume that the theoretical residuals are iid normal. The observed residuals are not i, i, or distributed normal (but do have a mean of 0). However, for practical purposes the observed residuals do estimate the theoretical residuals and are therefore still useful for diagnostics.
If your sample size is large, non-normality will be significant even if distribution is similar (but not equal) to normal. On the other hand, what can cause problems in ANOVA is departure from normal, not significance of that departure. Then, we need to measure that departure.
The usual measure is to check skewness and kurtosis. If skewness is small and kurtosis is not very different from that of normal distribution we can assume that distribution is nearly normal for most practical purposes. Furthermore, ANOVA is quite robust about the normality assumption and results are not expected to change a lot due to small departure from that assumption (and the same could be said about the assumption of equal variances). To asses how big is departure from normality a rule of thumb given by Statgraphics in-program help (sorry, I can't find any other reference) is the interval -2, +2 for standardised kurtosis and skewness.
Anyway, if distribution is actually far from normal, then you can use a non-parametric test like Kuskal-Wallis.
Update about equal variances
About the assumption of equal variances, it can be said the same: it doesn't matter much we can be sure that variances are not exactly the same, what matters if how different variances are. From your graphics I would say that variation of your residuals don't look very different, so you aren't very far from homoscedasticity. If you compute variances for each group, a rule of thumb is that ANOVA results are still valid while the biggest variance is no more than ten times the smaller one (again no references, I just heard it from a more experienced professor).
Update about statistical significance vs practical significance
Your distributions are nearly normal and there is an small (maybe tiny) departure from normality. If your sample were small, no test could detect such small departure from normality, but with a large sample tests can detect that your distributions are not exactly normal. That little difference is real (hence the little p-value) but it is too small to matter for practical purposes like performing ANOVA.
I suggest reading about statistical significance vs practical significance. You can Google it or just go to here or here.
Best Answer
You could consider Cook's distance as an aid for your decision. It is a measure for the effect that removing this observation would have on your analysis. Values with a large Cook's distance merit further attention, those that have a small Cook's distance despite being far outside the range of your other observations shouldn't do much harm. As you do not say which statistical software you use, I cannot tell you exactly how to do that. I use
R
myself, I would look at the fourth graph of theplot
of anlm
-object:(where
lm.fit
is anlm
-object)My apologies, this suggestion would probably have been better suited for a comment, but I lack the reputation to comment.