Solved – Predict after using Box Cox Transformation

data transformationgeneralized linear modelmultiple regressionpredictive-models

I am doing a Multiple Linear Regression on a data set where:
The response variable is continuous
One of the explanatory variables is continuous and the rest are binary(categorical) 1 if it is there 0 if it is not.

I did the Multiple linear regression on my data and found that it had non constant variance so I used Box Cox transformation.

The Box Cox transformation seemed to have worked very well. It had good residual vs. fitted values plots, residuals with a normal distibution and good r-squared and adjusted r-squared values.

The data I did the Box Cox transformation on was a training set. I now need to perform a model validation on the test set. I am using R to do my calculations. When I use the predict function in R the predicted values will be in the transformed state.

I would also like to use the cv.lm function in R which performs a cross validation using a given model and a data set. When I used this I am not quite sure which data set to use. The original or the transformed. Information on cv.lm can be found here http://www.statmethods.net/stats/regression.html and http://www.inside-r.org/packages/cran/DAAG/docs/CVlm

My questions are:

  1. Once I have the predicted values can I just use the inverse of the Box Cox to get my values back to original?

  2. If not how do I proceed from here to make sense of my model? I have looked a lot of places online and would really like some insight or expertise in this.

Thanks in advance.

Best Answer

It's common to think of two very different goals when fitting statistical models: inference and prediction. It seems like you might be confusing the two.

The most common use of the Box-Cox transformation is to make the residuals "better behaved"; that is, iid Normal(0, $\sigma^2 I$). If the residuals conform to this assumption after the transformation then the hypothesis tests (namely the F-test and t-tests) that one might like to perform to assess the significance of the estimated regression parameters are valid. To be clear, without the iid Normal(0, $\sigma^2 I$) assumption, the hypothesis tests are invalid. This is what I mean by inference.

Prediction, on the other hand, does not require such assumptions. You merely fit your model on the training data and predict on the holdout data.

So it really just depends on your goal. If you're only trying to make good predictions there's no need to fiddle with Box-Cox. But if your interested in statistical significance, it's useful to consider it. If your goal is to do both then there's no reason you can't use the inverse of the transformation on your predictions.