Solved – Comparing RSS from linear and higher power models in a training and test data

machine learningregression

I'm currently trying to complete the book 'Introduction to Statistical Learning' by James et al. and I'm stuck in one of the exercises, trying to get some logic out of the results.

The question is this one (see in line):

  1. I collect a set of data (n = $100$ observations) containing a single predictor and a quantitative response. I then fit a linear regression
    model to the data, as well as a separate cubic regression, i.e. $Y =
    β_0 +β_1X +β_2X^2 +β_3X^3 +\varepsilon$.

(a) Suppose that the true relationship between X and Y is linear, i.e.
$Y = β_0 + β_1X + \varepsilon$. Consider the training residual sum of
squares (RSS) for the linear regression, and also the training RSS for
the cubic regression. Would we expect one to be lower than the other,
would we expect them to be the same, or is there not enough
information to tell? Justify your answer.

Given that the true relationship is linear, I thought adding a cubic term wouldn't improve the fit, it would take up degrees of freedom, and simple would increase the RSS. I tried to test this empirically:

training <- function(p, n) {
set.seed(1)
x <- rnorm(n)
y <- 5 + 2*x + rnorm(n, sd = 0.5)

sigma_linear <- summary(lm(y[p] ~ x[p]))$sigma
sigma_polynomial <- summary(lm(y[p] ~ x[p] + I(x[p]^2) + I(x[p]^3)))$sigma

c(linear = sigma_linear, polynomial = sigma_polynomial)  
}

training(1:20, 100)
 #    linear polynomial 
 # 0.4177607  0.4203941 

If you run this, the linear RSS is smaller than the cubic one, which is what I would expect. But remove the set.seed option from the function and replicate this a couple of times and in many cases the cubic one will be higher. This might be simple random noise and the cubic term doesn't add or remove anything from the model. I don't understand why the cubic model would give a better 'fit'. In fact, in that exact link, both higher power terms are insignificant and the RSE is higher for the cubic model.

(b) Answer (a) using test rather than training RSS.

Now, using the remaining part of the data:

training(20:100, 100)
 #    linear polynomial 
 # 0.4957765  0.4984665

The cubic sigma is still higher. But again, replicate without the set.seed option and it will sometimes be smaller or higher.

Why do then people expect for the first cubic model to have a lower RSS in the linear model?

Best Answer

Here is the result of $\small 1,000$ simulations of your basic linear data-generating process, each time with $100$ data points, and each time fitted with polynomial regression models of degree $1$ to $5$, and getting the corresponding $5$ different SSR for each iteration:

mat = matrix(rep(0,5000), nrow=1000)
for(i in 1:1000){
  n = 100
  x = rnorm(n)
  y = 5 + 2 * x + rnorm(n, 0.5)
  for(j in 1:5){
  mat[i,j] = sum(residuals(lm(y ~ poly(x,j,raw=T)))^2)
  }              
}

enter image description here

Clearly, even if we use the same exact linear dataset for each iteration, the tendency is to a progressive decrease in $\text{RSS}$ with increasing polynomial degree in the model.

In fact these differences are statistically significant when running an ANOVA between the different polynomial models (p-value: 6.444e-10).

This, I think, answers your question.


BACKGROUND:

In general, the higher the order of the polynomial model, the better the fit is going to be to the data points in the training set. You can see a funny example of it in this post. Logically, the problem is overfitting, and the consequent out-of-sample error.

Just to frame the answer:

\begin{align} \text{SST}_{\text{otal}} &= \color{red}{\text{SSE}_{\text{xplained}}}+\color{blue}{\text{SSR}_{\text{esidual}}}\\ \end{align} or \begin{align}\small \sum(y_i-\bar y)^2 &=\small \color{red}{\sum(\hat y_i-\bar y)^2}+\color{blue}{\sum(y_i-\hat y_i)^2} \end{align}

In the OP, the R code calls for the Residual Standard Error instead of the sum of squares residual (SSR or RSS), which amounts to:

$$\small \text{RSE} = \sqrt{\frac{\sum(y_i - \hat y_i)^2}{\text{df}}}$$

However, it doesn't change what follows.


Here is an example with the dataset mtcars plotting the predicted polynomial lines on the data points of miles per gallon (mpg) versus weight (wt) of different car models, resulting from increasing polynomial order models from $1$ to $10$:

enter image description here

Although the relationship is fairly linear, as can be inferred from general knowledge about cars, and from inspecting the data points, as the the degree of the polynomial model increases the predicted points fall closer and closer to the actual data points because there is increased ability to accommodate individual variations resulting from noise in the data.

Concomitantly, there is a drop in the RSS:

enter image description here

all of it at the expense of overfitting and increased out-of-sample error.

Related Question