Solved – non normality in multiple linear regression

multiple regression

I am trying to figure out which regressors to include in my model and assess my model's adequacy. I know my data is skewed. My question is:
should I do transformation first or model selection first?

When I fit the full model, i seem to have non-constant variance of the error and also deviance from normality. I have applied a log transformation of the response variable: This removes the non constancy of the error but adds a curvature in the qqplot. I would like to use my model for frequenstist prediction and baysian prediction. I am aware thet deviations from normality can cause inacurrate prediction results. What should I do about the non-normality?

I have conducted a Shapiro test- it has been rejected, therefore i conclude that there is enough evidence that the data are not normal.

EDIT: My sample size is 250. Can i ignore the non-normality because I have many observations?

The response variable is Salary:

EDIT 2: Added Variable Plots (as kindly suggested by Whuber)

As far as I know Added variable plots are used to detect disproportionate influence of observations. I do not see anything suspicious here that would explain or suggest the indicated bimodality.

Am I missing something here?

Best Answer

Okay, a few things.

1) I always advise against using tests for normality. They answer a question you already know the answer to, i.e. "Is your data normal?" (The answer is no because nothing is normal) vs the question "Is the lack of normality going to be a problem?" which is the question you should be interested in.

2) The assumption of normality is not so much about the predictive performance, but rather the correctness of the inference you would perform (hypothesis tests and confidence intervals).

3) Some deviation from normality is okay, because we have asymptotics that drive test statistics to normality.

4) You QQ-plot does not appear to be severely not normal (although there might be some bimodality in your residuals. You may want to check if there is an omitted variable or something). As another commenter stated, the normality is the one that can kind of fail (can have mild - moderate deviations from it).

5) So to answer your question

(i) Yes, you do the log transform (or some other transformation) first.

(ii) Once you transform your variable the nonnormality EDIT may be worth looking to see why the residuals seem to be in two distinct clusters.

Best Answer

Related Solutions

Solved – Linear relationship between explanatory variables in multiple regression

Solved – Non-normality of residuals in linear regression of very large sample in SPSS

Related Question