When regression coefficient is nearly 0 (in fact in the real model it's exactly 0), what's the meaning of p value (<0.05) of the coefficient?
For example, I did a multiple variable regression with simulated data in R with lm().
Generate simulation data with the equation
$$
y=2x_1^2+3x_2^2+3x_1+5
$$
The terms $x_1x_2$ and $x_2$ coefficients are zero. Using the data to do regression.
xmesh=mesh(seq(-4,4,0.1),seq(-4,4,0.1))
x1=as.vector(xmesh$x)
x2=as.vector(xmesh$y)
y=2*x1^2+3*x2^2+3*x1+5
model=lm(y~x1+x2+I(x1^2)+I(x2^2)+I(x1*x2))
summary(model)
The result is :
Call:
lm(formula = y ~ x1 + x2 + I(x1^2) + I(x2^2) + I(x1 * x2))
Residuals:
Min 1Q Median 3Q Max
-8.871e-12 -4.500e-15 -7.000e-16 5.700e-15 4.194e-12
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.000e+00 3.301e-15 1.515e+15 < 2e-16 ***
x1 3.000e+00 7.545e-16 3.976e+15 < 2e-16 ***
x2 -3.348e-15 7.545e-16 -4.438e+00 9.22e-06 ***
I(x1^2) 2.000e+00 3.609e-16 5.542e+15 < 2e-16 ***
I(x2^2) 3.000e+00 3.609e-16 8.314e+15 < 2e-16 ***
I(x1 * x2) -9.377e-16 3.227e-16 -2.906e+00 0.00367 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.429e-13 on 6555 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 2.313e+31 on 5 and 6555 DF, p-value: < 2.2e-16
We can see that the coefficients of term $x_2$ and $x_1x_2$ are nearly 0, and p-value<0.01. I think that lm() did the significance test of coefficient based on t-test with NULL hypothesis $\beta=0$. So p-value<0.05 should mean that the coefficient is significantly different to 0. However the coefficient should be 0 in my model. I am confused. How to interpret these two coefficients' significance?
Add a new test $y=2x_1^2+3x_1+0.001x_2+5$
> y2=2*x1^2+3*(x1)+5+0.001*x2
> model3=lm(y2~x1+x2+I(x1^2)+I(x2^2)+I(x1*x2))
> summary(model3)
Call:
lm(formula = y2 ~ x1 + x2 + I(x1^2) + I(x2^2) + I(x1 * x2))
Residuals:
Min 1Q Median 3Q Max
-9.237e-12 -1.700e-15 -1.000e-16 2.200e-15 2.757e-12
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.000e+00 2.840e-15 1.761e+15 <2e-16 ***
x1 3.000e+00 6.492e-16 4.621e+15 <2e-16 ***
x2 1.000e-03 6.492e-16 1.540e+12 <2e-16 ***
I(x1^2) 2.000e+00 3.105e-16 6.441e+15 <2e-16 ***
I(x2^2) -2.722e-16 3.105e-16 -8.770e-01 0.381
I(x1 * x2) -3.226e-16 2.776e-16 -1.162e+00 0.245
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.229e-13 on 6555 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.257e+31 on 5 and 6555 DF, p-value: < 2.2e-16
You can see that in the new test, the coefficients and std Error of term $x_1x_2$ and $x_2^2$ are essentially zero. Their p-value are large enough to accept the null hypothesis that $\beta=0$, it's a good result.
How to interpret the p-value of essentially zero coefficients in the two tests?
Best Answer
This has more to do with how computers work than with p-values. You have to remember that computers can't represent real numbers exactly. We are dealing with floating point numbers. So some algorithms will never get exactly zero, even if analytically the result should be zero. For example
(0.3-0.2) - (0.2-0.1)
will not give you zero.You can see that the estimates are essentially zero:
The same goes for your standard errors: they are zero.