I am trying to fit a quadratic to my model, I have tuples (x,y).
The choices are,
1) lm(y~x+I(x^2))
2) lm(y~(x-mean(x))+I(x-mean(x))^2)
3) lm(y~(x-mean(x))+I(x^2 - mean(x^2)))
In other words, in 3, I am centering the quadratic term, using its own mean.
I do understand that centering to reduce multicollinearity is not an issue here. I am just looking to understand how to center in general. Intuitively 3) makes more sense, I am treating the linear and the quadratic vars as separate and just centering them in a usual way. 2 is odd because the quadratic term will also have a linear component once you open the squares up. 1 and 3 give the same coefficients which is different from 2, but there seems to be no relationship between the linear coefficient from 2 and 1. The quadratic coefficient is the same across all models.
The outputs are
model 1)
Call:
lm(formula = y ~ x + I(x^2))
Residuals:
Min 1Q Median 3Q Max
-73.845 -10.151 1.224 9.660 73.553
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 262.709845 82.982956 3.166 0.0016 **
x 0.150473 1.346574 0.112 0.9111
I(x^2) -0.002182 0.005459 -0.400 0.6895
model 2)
Call:
lm(formula = y ~ (x-mean(x)) + (x-mean(x))^2)
Residuals:
Min 1Q Median 3Q Max
-73.845 -10.151 1.224 9.660 73.553
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 247.263060 0.657972 375.796 <2e-16 ***
x -mean(x) -0.396789 0.080544 -4.926 1e-06 ***
(x -mean(x))^2 -0.002182 0.005459 -0.400 0.69
And model 3)
Call:
lm(formula = y ~ (x - mean(x)) + I(x^2 - mean(x^2)))
Residuals:
Min 1Q Median 3Q Max
-73.845 -10.151 1.224 9.660 73.553
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 247.138199 0.579052 426.798 <2e-16 ***
x - mean(x) 0.150473 1.346574 0.112 0.911
I(x^2 - mean(x^2)) -0.002182 0.005459 -0.400 0.690
Notice 1 and 3) give the same coefficient estimates and 2 is different for the coefficient on the linear term. The coefficient of the quadratic term all agree. The model 2 is significant for the linear term and the other ones are not, why?
Best Answer
When you fit a regression model for a single variable and its squared effect, the interpretation of coefficient for the linear term changes. The coefficient for the linear term is the instantaneous slope of the parabola at the intercept. Therefore, it's easy to see how models 1 and 2 differ. Model 1 will give you the tangent line for the parabola at the intercept whereas Model 2 will give you the tangent line at the mean of $x$. Model 3, on the other hand is very complicated... you can see with some algebra what you are fitting:
$$ \begin{eqnarray} y &=& a + b*(x-\bar{x}) + c * (x^2 - \bar{x}^2)\\ &=& a + b*(x-\bar{x}) + c * (x-\bar{x})^2 - 2c(x - \bar{x})\\ &=& a + (b-2c)*(x-\bar{x}) + c * (x-\bar{x})^2 \\ \end{eqnarray} $$