Solved – Box Cox Transforms for regression

data transformationrregressionvariance

I'm trying to fit a linear model on some data with just one predictor (say (x,y)). The data is such that for small values of x, the y values give a tight fit to a straight line, however as x values increase, the y values become more volatile. Here is an example of such data (R code)

    y = c(3.2, 3.4, 3.5, 3.8, 4.2, 5.5, 4.5, 6.8, 7.4, 5.9)
    x = seq(1,10,1)

I'm curious to know if there exists any power transform (Box cox perhaps?) that lets me get a better fit for the data than simply doing a linear fit as shown below.

    fit = lm(y ~ x)

Best Answer

The MASS package that comes with your R installed already, has the boxcox() function that you can use: After reading in the data, do:

    library(MASS)
    boxcox(y ~ x)

Then look at the graph this produces, which shows graphically a 95% confidence interval for the boxcox transformation parameter. But you do not really have enough data (n=10) to do this, the resulting confidence interval goes almost from -2 to 2!, with a maximum likelihood estimate of approximately 0 (a log-transform, as said before). If your real data have more observations, you should try this.

As others have said, this transformation is really trying to stabilize variances. This is not really obvious from theory, what it does, is to try to maximize a normal-distribution based likelihood function, which assumes constant variance. One could think that maximizing a normal-based likelihood would try to normalize the distribution of the residuals, but in practice the main contribution to maximizing the likelihood comes from stabilizing the variances. This is maybe not so surprising, given that the likelihood we maximizes is based on a constant variance normal distribution family!

I once wrote a slider-based demo in XLispStat, which demonstrated this clearly!

Related Question