I am considering the following conceptual question from Introduction to Statistical Learning, chapter 3, number 4.
I collect a set of data ($n$ = 100 observations) containing a single
predictor and a quantitative response. I then fit a linear regression
model to the data, as well as a separate cubic regression, i.e. $Y =
\beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \epsilon$.(a) Suppose that the true relationship between $X$ and $Y$ is linear,
i.e. $Y = \beta_0 + \beta_1X + \epsilon$. Consider the training residual sum of squares (RSS) for the linear regression, and also the training
RSS for the cubic regression. Would we expect one to be lower
than the other, would we expect them to be the same, or is there
not enough information to tell? Justify your answer.(b) Answer (a) using test rather than training RSS.
The community around the text has answered this in terms of model flexibility. The cubic polynomial makes a tighter fit against the training data and has a smaller training RSS. The overfit of the training data causes a higher test RSS.
My question is about the cubic fit on the data with an underlying linear relationship. Wouldn't a cubic regression identify the lack of importance of $X^2$ and $X^3$ as predictor variables?
I prepared some sample data to prove this for myself:
linear fit on linear data
power coeff SE T-stat p-value
0 5.011958 0.038305 130.844922 4.091123238180578e-112
1 0.299021 0.002319 128.950570 1.6942879597381592e-111
cubic fit on linear data
power coeff SE T-stat p-value
0 5.017693 0.043301 115.880040 2.9231219834217572e-105
1 0.305327 0.007315 41.739973 1.1965359594166447e-63
2 -0.000642 0.000626 1.026529 0.15361089123836455
3 0.000014 0.000014 0.982705 0.16411145023789503
Am I missing something in my thinking?
Best Answer
Your argument is correct. Since the true relationship is linear, the square and cubic terms in the cubic fit are not significant. And that is confirmed by the large p-values (0.215 and 0.461).
As mentioned by @mark999, it seems you are using a very large sample size because the standard errors are so small. If you follow the original question and use $n=100$, the numbers would be different but the conclusion will stay the same.