So, I would recommend using standard method for comparing nested models. In your case, you consider two alternative models, the cubic fit being the more "complex" one. An F- or $\chi^2$-test tells you whether the residual sum of squares or deviance significantly decrease when you add further terms. It is very like comparing a model including only the intercept (in this case, you have residual variance only) vs. another one which include one meaningful predictor: does this added predictor account for a sufficient part of the variance in the response? In your case, it amounts to say: Modeling a cubic relationship between X and Y decreases the unexplained variance (equivalently, the $R^2$ will increase), and thus provide a better fit to the data compared to a linear fit.
It is often used as a test of linearity between the response variable and the predictor, and this is the reason why Frank Harrell advocates the use of restricted cubic spline instead of assuming a strict linear relationship between Y and the continuous Xs (e.g. age).
The following example comes from a book I was reading some months ago (High-dimensional data analysis in cancer research, Chap. 3, p. 45), but it may well serves as an illustration. The idea is just to fit different kind of models to a simulated data set, which clearly highlights a non-linear relationship between the response variable and the predictor. The true generative model is shown in black. The other colors are for different models (restricted cubic spline, B-spline close to yours, and CV smoothed spline).
library(rms)
library(splines)
set.seed(101)
f <- function(x) sin(sqrt(2*pi*x))
n <- 1000
x <- runif(n, 0, 2*pi)
sigma <- rnorm(n, 0, 0.25)
y <- f(x) + sigma
plot(x, y, cex=.4)
curve(f, 0, 6, lty=2, add=TRUE)
# linear fit
lm00 <- lm(y~x)
# restricted cubic spline, 3 knots (2 Df)
lm0 <- lm(y~rcs(x,3))
lines(seq(0,6,length=1000),
predict(lm0,data.frame(x=seq(0,6,length=1000))),
col="red")
# use B-spline and a single knot at x=1.13 (4 Df)
lm1 <- lm(y~bs(x, knots=1.13))
lines(seq(0,6,length=1000),
predict(lm1,data.frame(x=seq(0,6,length=1000))),
col="green")
# cross-validated smoothed spline (approx. 20 Df)
xy.spl <- smooth.spline(x, y, cv=TRUE)
lines(xy.spl, col="blue")
legend("bottomleft", c("f(x)","RCS {rms}","BS {splines}","SS {stats}"),
col=1:4, lty=c(2,rep(1,3)),bty="n", cex=.6)
Now, suppose you want to compare the linear fit (lm00
) and model relying on B-spline (lm1
), you just have to do an F-test to see that the latter provides a better fit:
> anova(lm00, lm1)
Analysis of Variance Table
Model 1: y ~ x
Model 2: y ~ bs(x, knots = 1.13)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 998 309.248
2 995 63.926 3 245.32 1272.8 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Likewise, it is quite usual to compare GLM with GAM based on the results of a $\chi^2$-test.
Best Answer
Your wording seems to imply that the error term exists because we deal with samples and the error term captures information about the non-sampled part of the population. That's not correct. The error term in a regression model represents factors other than the observed variables included in the model as $X$'s (explanatory/independent variables) that affect the dependent variable $Y$. Regression model (e.g., $y = \beta_{0} + \beta_{1}x + \epsilon$) begins from assuming what the relationship between $X$ and $Y$ variables is in the population, so the error term exists even in the population model. The model you end up estimating with sample data allows you to estimate the parameters of that population model.
So the error term is NOT the difference between observed and predicted values of $Y$. Repeating myself, it represents unobserved factors affecting $Y$.
Once the population regression model is assumed, we proceed to estimating the model with randomly sampled data. The estimation/fitting procedure we use estimates the values of $\beta$'s and we can then compute predicted/fitted values of $Y$ based on those estimated values of $\beta$'s and the observed values of $X$'s. The estimated regression equation takes on the form: $y_{i} = \hat\beta_{0} +\hat\beta_{1}x_{i} + \hat\epsilon_{i}$, with those hats denoting estimated values. $\hat\epsilon_{i}$'s are the residuals in the estimated equation (differences between observed and predicted values of $Y$ for each individual in the sample), while $\epsilon$'s are the errors in the equation containing population parameters $\beta_{0}$ and $\beta_{1}$. The errors are not observable, while the residuals are computed from the data.