Solved – Non-normality of residuals in linear regression of very large sample in SPSS

large datamultiple regressionnormality-assumptionresidualsspss

I have a dataset of ~17,000 cases in SPSS 21 with which I am trying to run multiple linear regression. I have plotted the Studentised residuals against the unstandardised predicted values and also against each predictor that was included in the model and these plots indicate a fair degree of skewness and a few possible outliers. The outcome variable itself is highly skewed so I also fitted a model based on a Log10 transformation of this data which produced less non-normal (though now slightly skewed in the opposite direction) residuals.

The data have been extensively cleaned and I'm quite confident that any outliers are not errors.

I'm unsure as to how to proceed. I read somewhere that the normality of residuals is actually of minor importance so long as the other assumptions (validity and linearity) are met. Especially consdering the large sample size, do I need to worry too much about the non-normality of the residuals? And if I were to use the model based on the transformed data, how would I properly interpret the output?

Thank you

Best Answer

The skewness of the outcome variable (treated unconditionally on the other variables) will depend on the arrangement of the independent variables -- it might validly be anything. You shouldn't be trying to make the distribution of the outcome look like any particular thing. It's the error term the normal assumption is needed for.

Normality of residuals probably isn't all that important compared to the other assumptions (unless you're after prediction intervals) -- you will want to focus more on getting the models for the mean and variance right.

That said, if a log-transform produces slightly left skew residuals, you might possibly do better with a Gamma GLM (the log of a gamma random variable is left skew, the degree of skewness depends on the gamma's shape parameter). Aside from that, the Gamma model with a log link has a lot of similarities to a linear model in the logs. This also has the advantage of readily dealing with other nonlinear relationships between the conditional mean of the outcome and the linear predictor (linear combination of the independent variables) by choice of a different link function.

(If such a GLM is suitable - and again, the model for the mean and variance matters more than the distributional assumption, it implies heteroskedasticity in your data; if there's no evidence of this you may not be better off than with linear regression)

And if I were to use the model based on the transformed data, how would I properly interpret the output?

If you assume approximate normality of the logs, it implies that your linear, additive-error model on the log-scale is a multiplicative lognormal model on the original scale.

I find it easier to interpret natural logs rather than base 10 logs (not least, I have a lot more practice at it), but since one is simply a scaled version of the other, most of the intuition carries across.

One the log scale, a unit change in one of your independent variables, $x_j$ produces an additive change of the corresponding coefficient in the outcome of $\beta_j$. On the original scale, a unit change in the independent variable multiplies the typical outcome (e.g. in the mean, or the median - the effect on either is the same) by $10^\beta_j$.

Beware: if you want to make statements about the (conditional) mean of the outcome (rather than changes in it, as discussed in the previous paragraph), you don't just take $10^\text{mean on the log scale}$. If you need to do this I can provide more details about the calculation under the normal assumption. (This is not an issue for the GLM approach, since it models the mean directly rather than via a transform)

However, prediction intervals, for example, transform back just fine.