Solved – Multiple Linear Regression – Residual Normality and Transformations

data transformationmultiple regressionregression

I have a multiple linear regression with about 20 significant predictors – some categorical and come continuous. I ran the model in Statsmodel in Python.

I get a high adj R^2 of approximately 0.95 which suggests good fit. I ran a predicted vs. actual plot (shown below) and have good linearity.

However, I'm having problems when I check assumptions. My residuals to not appear to be normally distributed.

My residuals vs predicted values plot looks like this:

I look at this and depending on the scale, conclude that the residuals MIGHT be randomly distributed around a mean of zero if the scale were changed, that there is "minimal" hetroscedacity, and there are some outliers.

However, if I plot a residuals histogram I get this:

Which indicates the that the residuals may be distributed symmetrically around a mean but not normally distributed.

If I plot a qq of the residuals I get this:

Which I understand to be a fat-tailed distribution.

So my questions are:

The linearity suggests the model is strong but the residual plots suggest the model is unstable. How do I reconcile? Is this a good model or an unstable one?
If the model is unstable, how can I transform the variables (independent, dependent, both) to get my residuals normally distributed while maintaining strong linearity. I've tried various transformations (log, ln, box cox, etc) on the dependent variable, all independent variables, and some independent variables and all it does is destroy the linearity while not fixing the residual distribution.

Am I missing something obvious?

Thanks in advance for help and suggestions.

Best Answer

I have run into this kind of situation many a time myself. Here are a few comments from my experience. Rarely is it the case that you see a QQ plot that lines up along a straight line.

The linearity suggests the model is strong but the residual plots suggest the model is unstable. How do I reconcile? Is this a good model or an unstable one?

Response: The curvy QQ plot does not invalidate your model. But, there seems to be way too many variables (20) in your model. Are the variables chosen after variable selection such as AIC, BIC, lasso, etc? Have you tried cross-validation to guard against overfitting? Even after all this, your QQ plot may look curvy. You can explore by including interaction terms and polynomial terms in your regression, but a QQ plot that does not line up nicely in a straight line is a not a substantial issue in practical terms.

Say you are comfortable with retaining all 20 predictors. You can, at a minimum, report White or Newey-West standard errors to adjust for collinearity among the 20 predictors as well as autocorrelation and heteroskedasticity. Your residual plots indicate few clear outliers. You can drop these observations and your QQ plot will look less curvy.

If the model is unstable, how can I transform the variables (independent, dependent, both) to get my residuals normally distributed while maintaining strong linearity. I've tried various transformations (log, ln, box cox, etc) on the dependent variable, all independent variables, and some independent variables and all it does is destroy the linearity while not fixing the residual distribution.

Response: The transformations you tried are all good to try. You need not be fixated on fixing the residuals plot. Even if the QQ plot does not line up on a straight line, your estimated OLS coefficients are still unbiased and consistent. What is impacted is your standard errors of those OLS coefficients, and you can apply common fixes such as White, Newey-West, or boostrapping to get a conservative estimate of the standard errors so that you do not conclude a coefficient is significant when it is not.

Related Solutions

Solved – Heteroskedasticity and residuals normality

One way to approach this question is to look at it in reverse: how could we begin with normally distributed residuals and arrange them to be heteroscedastic? From this point of view the answer becomes obvious: associate the smaller residuals with the smaller predicted values.

To illustrate, here is an explicit construction.

The data at the left are clearly heteroscedastic relative to the linear fit (shown in red). This is driven home by the residuals vs predicted plot at the right. But--by construction--the unordered set of residuals is close to normally distributed, as their histogram in the middle shows. (The p-value in the Shapiro-Wilk test of normality is 0.60, obtained with the R command shapiro.test(residuals(fit)) issued after running the code below.)

Real data can look like this, too. The moral is that heteroscedasticity characterizes a relationship between residual size and predictions whereas normality tells us nothing about how the residuals relate to anything else.

Here is the R code for this construction.

set.seed(17)
n <- 256
x <- (1:n)/n                       # The set of x values
e <- rnorm(n, sd=1)                # A set of *normally distributed* values
i <- order(runif(n, max=dnorm(e))) # Put the larger ones towards the end on average
y <- 1 + 5 * x + e[rev(i)]         # Generate some y values plus "error" `e`.
fit <- lm(y ~ x)                   # Regress `y` against `x`.
par(mfrow=c(1,3))                  # Set up the plots ...
plot(x,y, main="Data", cex=0.8)
abline(coef(fit), col="Red")
hist(residuals(fit), main="Residuals")
plot(predict(fit), residuals(fit), cex=0.8, main="Residuals vs. Predicted")

Solved – Interpreting a binned residual plot in logistic regression

Either I am misinterpreting your plot or there is some problem. The fact that you have negative residuals for near 0 expected values implies that your model is predicting negative value. This should not be possible for logistic regression models which only predict in the (0, 1) interval, unless you are using the log-odds output of the model in which case residual error should be undefined. Since logistic regression is a classification method, it is more useful to look at the confusion matrix first. You should also specify if the graph is based on the train data or a separate test set.

Best Answer

Related Solutions

Solved – Heteroskedasticity and residuals normality

Solved – Interpreting a binned residual plot in logistic regression

Related Question