Solved – Regression with non-normally distributed residuals

assumptionslinear modelnormal distributionnormality-assumptionregression

There are several posts on this site talking about the need of normality when interpreting the meaning of the p.value of a linear regression. But not much I think is said about how to deal with non-normal data set. On this post, they give some solutions when the distribution is long-tailed.

I'm dealing with a case where my residuals (and my dependent variable) has a multimodal distribution (as it can be seen on the following Kernel density graphs) and takes discrete values (as it can be seen in the other graphs). My model takes "FP" as dependent variable and "Design Complexity" and "Sample size".

On the following graph, each point and whiskers represent 20 points. I strongly expected to have an effect of "Design Complexity" but didn't know wether "Sample size" may have an effect. As I couldn't reject the possibility that an interaction might exist between "Sample size" and "Design Complexity" I observed both.

enter image description here

Here is the R code for ths model.

summary(aov(FP~Obs.size*Design.Complexity, data=data.and.factors))

But my residuals are definitely not normally distributed:

note: All the following graphs are presented in both kernel density and regular x-y plot

plot(density(residuals(aov(data.and.factors$FP~data.and.factors$Design.Complexity*data.and.factors$Obs.size))))

enter image description here
enter image description here

My dependent variable is not normally distributed either and its distribution change from one "Design Complexity" value to another (see below)

Distribution of FP

enter image description here
enter image description here

Distributions of FP for Design Complexity equal 1, 2, 3, 4, 5 and 6.
enter image description here
enter image description here

How can I get trustfull p.values ?

Best Answer

I wouldn't call that multinomial. Residuals are measured on a continuous scale, but they have multimodal distribution, rather than multinomial.

By the way, a kernel density plot makes the modality more difficult to judge. Some kind of strip plot or strip chart would be helpful.

Commenting on your model would be easier with some scientific context, but sample size is less convincing as a control than design complexity. If sample size is an issue at all, would you expect an interaction? Would you expect a linear relationship?

Design complexity has a strong effect. If you start with the much simpler model in which design complexity is the only factor, then what is crucial is the distribution of residuals at each level of complexity.

My bottom line is that normality of residuals is less of a big deal than you seem to think. You seem to have approximate symmetry of residuals, and perhaps your P-values will be a little untrustworthy, but they are usually dubious any way.