Solved – Centering in linear regression

centeringregression

I am trying to fit a quadratic to my model, I have tuples (x,y).

The choices are,

1) lm(y~x+I(x^2))

2) lm(y~(x-mean(x))+I(x-mean(x))^2)

3) lm(y~(x-mean(x))+I(x^2 - mean(x^2)))

In other words, in 3, I am centering the quadratic term, using its own mean.

I do understand that centering to reduce multicollinearity is not an issue here. I am just looking to understand how to center in general. Intuitively 3) makes more sense, I am treating the linear and the quadratic vars as separate and just centering them in a usual way. 2 is odd because the quadratic term will also have a linear component once you open the squares up. 1 and 3 give the same coefficients which is different from 2, but there seems to be no relationship between the linear coefficient from 2 and 1. The quadratic coefficient is the same across all models.

The outputs are

model 1)

Call:
lm(formula = y ~ x + I(x^2))

Residuals:
    Min      1Q  Median      3Q     Max 
-73.845 -10.151   1.224   9.660  73.553 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept) 262.709845  82.982956   3.166   0.0016 **
x             0.150473   1.346574   0.112   0.9111   
I(x^2)       -0.002182   0.005459  -0.400   0.6895   

model 2)

Call:
lm(formula = y ~ (x-mean(x)) + (x-mean(x))^2)

Residuals:
    Min      1Q  Median      3Q     Max 
-73.845 -10.151   1.224   9.660  73.553 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 247.263060   0.657972 375.796   <2e-16 ***
x -mean(x)    -0.396789   0.080544  -4.926    1e-06 ***
(x -mean(x))^2 -0.002182   0.005459  -0.400     0.69    

And model 3)

Call:
lm(formula = y ~ (x - mean(x)) + I(x^2 - mean(x^2)))

Residuals:
    Min      1Q  Median      3Q     Max 
-73.845 -10.151   1.224   9.660  73.553 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)        247.138199   0.579052 426.798   <2e-16 ***
x - mean(x)         0.150473   1.346574   0.112    0.911    
I(x^2 - mean(x^2))  -0.002182   0.005459  -0.400    0.690    

Notice 1 and 3) give the same coefficient estimates and 2 is different for the coefficient on the linear term. The coefficient of the quadratic term all agree. The model 2 is significant for the linear term and the other ones are not, why?

Best Answer

When you fit a regression model for a single variable and its squared effect, the interpretation of coefficient for the linear term changes. The coefficient for the linear term is the instantaneous slope of the parabola at the intercept. Therefore, it's easy to see how models 1 and 2 differ. Model 1 will give you the tangent line for the parabola at the intercept whereas Model 2 will give you the tangent line at the mean of $x$. Model 3, on the other hand is very complicated... you can see with some algebra what you are fitting:

$$ \begin{eqnarray} y &=& a + b*(x-\bar{x}) + c * (x^2 - \bar{x}^2)\\ &=& a + b*(x-\bar{x}) + c * (x-\bar{x})^2 - 2c(x - \bar{x})\\ &=& a + (b-2c)*(x-\bar{x}) + c * (x-\bar{x})^2 \\ \end{eqnarray} $$

Related Question