Solved – Interpreting how much the linear model has improved after Box-Cox transformation

assumptionsdata transformationlinear modelregression

I am working on a linear regression project where I first removed insignificant variables, then looked at a possible transform of the data. I performed the variable selection smoothly, however am having trouble interpreting an effective transform for the data to fit the assumptions of a linear model.

I after identifying that my data-set requires transformation (as some of the 4 assumptions of linear regression were violated for the original dataset), I tried 4 transformations:

  1. log model (log response variables)
  2. log-log model (log response variables and log explanatory variables)
  3. Box-Cox on Y
  4. Box-Cox on X & Y

I found that the Box-Cox on X & Y produced the highest R2, and thus selected that as the 'best' transformation.

Upon re-checking the assumptions under the transformed data-set, I found from the partial residual plots that one of the explanatory variables still displayed non-linear relationship with the residuals.

Partial Residual plot before transformation:
enter image description here

Partial Residual plot after transformation:
enter image description here

As well, the QQ-plot of residuals confirming normality of the data is changed to have more extreme tails rather than more skew, and is still not 'perfect' to a normal distribution

QQ-plot before transformation:
QQ-plot before transformation:

QQ-plot after transformation:
QQ-plot after transformation:

Finally, the residuals plotted against the fitted to check for constant variance seem to be worse off after the transformation than before:

Residuals vs Fitted before transformation:
enter image description here

Residuals vs Fitted after transformation:
enter image description here

From looking at these concerns, how would I interpret the effectiveness of this transform on the data?

Best Answer

  • After you applied your 4 transformation, you must have should check if the model assumptions are satisfied (you did not do this step, you directly selected the transformation based on $R^2$. Hence your partial residual plot showed the non linear trend).
  • Also $R^2$ is never to be used for selection of transformation.
  • If all the transformations equally satisfy the assumptions, then you must choose the transformation which makes the interpretation of the transformed variables the easiest.