Solved – What to do if residual plot looks good but qq-plot doesn’t, after transforming the predictor and response variables

data transformationmultiple regressionnormality-assumptionresiduals

I'm doing a multiple regression model on environmental data and am stuck on checking the assumptions. Ultimately, I need to do a model selection for the data. There are various explanatory variables and the response variable is expenditure, so this is a predictive model. There are 5 response variables. One is quantitative and the rest are categorical. For the 4 that are categorical, I have coded with dummy variables. I transformed my quantitative variable and response variable and the residual plot looks better than the original residual plot; however, the qq plot looks skewed. I did a log transformation for both. (It wouldn't make sense to transform the dummy variables right?)

residual plot (original data)

qq plot (original data)
enter image description here

residual plot (transformed data)
enter image description here

qq plot (transformed data)
enter image description here

My question is what should I do? I don't think the assumptions have been met for me to proceed with analysis yet.

The second question I have is how do I deal with the dummy variables? I have coded them to be 1 true, 0 false but how exactly do I deal with them in determining the final model? For example, one variable is permit type, and I have 7 of them so I will need 6 dummy variables. I already coded them in the excel file. Another variable is air quality with 1 being yes and 0 being no. Another one is a document that has 3 types, so I have 2 dummy variables. etc.

Best Answer

I notice your response variable is expenditure. I'm guessing from your plots that you don't have any Y data that are negative, and it seems that you have lots of 0's. That all seems consistent with my conception of expenditure. It is not consistent with using a normal (OLS) linear regression model. You may want to switch to using a generalized linear model with a Gamma response. If that doesn't work, or is too complicated, you could use ordinal logistic regression, which is fine as long as you can assume the response is at least ordinal.

Regarding the transformations, the fact that you transform one variable does not mean you have to transform any others. (For more on why you might transform variables, see this excellent CV thread: In linear regression, when is it appropriate to use the log of an independent variable instead of the actual values?) There is no reason to transform the indicators for categorical variables. For a continuous response and explanatory variables, a log transformation may help achieve linearity and normality of residuals, but it is important to bear in mind that it also changes the meaning of your beta estimates. (For more detail on that, see this excellent CV thread: Interpretation of log transformed predictor.)

Related Question