Solved – Assumptions of linear models and what to do if the residuals are not normally distributed

assumptionslinear modelnormality-assumptionresiduals

I am a little bit confused on what the assumptions of linear regression are.

So far I checked whether:

all of the explanatory variables correlated linearly with the response variable. (This was the case)
there was any collinearity among the explanatory variables. (there was little collinearity).
the Cook's distances of the datapoints of my model are below 1 (this is the case, all distances are below 0.4, so no influence points).
the residuals are normally distributed. (this may not be the case)

But I then read the following:

violations of normality often arise either because (a) the
distributions of the dependent and/or independent variables are
themselves significantly non-normal, and/or (b) the linearity
assumption is violated.

Question 1
This makes it sound as if the independent and depend variables need to be normally distributed, but as far as I know this is not the case. My dependent variable as well as one of my independent variables are not normally distributed. Should they be?

Question 2
My QQnormal plot of the residuals look like this:

That slightly differs from a normal distribution and the shapiro.test also rejects the null hypothesis that the residuals are from a normal distribution:

> shapiro.test(residuals(lmresult))
W = 0.9171, p-value = 3.618e-06

The residuals vs fitted values look like:

What can I do if my residuals are not normally distributed? Does it mean the linear model is entirely useless?

Best Answer

First off, I would get yourself a copy of this classic and approachable article and read it: Anscombe FJ. (1973) Graphs in statistical analysis The American Statistician. 27:17–21.

On to your questions:

Answer 1: Neither the dependent nor independent variable needs to be normally distributed. In fact they can have all kinds of loopy distributions. The normality assumption applies to the distribution of the errors ($Y_{i} - \widehat{Y}_{i}$).

Answer 2: You are actually asking about two separate assumptions of ordinary least squares (OLS) regression:

One is the assumption of linearity. This means that the trend in $\overline{Y}$ across $X$ is expressed by a straight line (Right? Straight back to algebra: $y = a +bx$, where $a$ is the $y$-intercept, and $b$ is the slope of the line.) A violation of this assumption simply means that the relationship is not well described by a straight line (e.g., $\overline{Y}$ is a sinusoidal function of $X$, or a quadratic function, or even a straight line that changes slope at some point). My own preferred two-step approach to address non-linearity is to (1) perform some kind of non-parametric smoothing regression to suggest specific nonlinear functional relationships between $Y$ and $X$ (e.g., using LOWESS, or GAMs, etc.), and (2) to specify a functional relationship using either a multiple regression that includes nonlinearities in $X$, (e.g., $Y \sim X + X^{2}$), or a nonlinear least squares regression model that includes nonlinearities in parameters of $X$ (e.g., $Y \sim X + \max{(X-\theta,0)}$, where $\theta$ represents the point where the regression line of $\overline{Y}$ on $X$ changes slope).
Another is the assumption of normally distributed residuals. Sometimes one can validly get away with non-normal residuals in an OLS context; see for example, Lumley T, Emerson S. (2002) The Importance of the Normality Assumption in Large Public Health Data Sets. Annual Review of Public Health. 23:151–69. Sometimes, one cannot (again, see the Anscombe article).

However, I would recommend thinking about the assumptions in OLS not so much as desired properties of your data, but rather as interesting points of departure for describing nature. After all, most of what we care about in the world is more interesting than $y$-intercept and slope. Creatively violating OLS assumptions (with the appropriate methods) allows us to ask and answer more interesting questions.

Best Answer

Related Solutions

Solved – What if residuals are normally distributed, but y is not

Solved – Regression with non-normally distributed residuals

Related Question