Solved – Comparing RSS from linear and higher power models in a training and test data

machine learningregression

I'm currently trying to complete the book 'Introduction to Statistical Learning' by James et al. and I'm stuck in one of the exercises, trying to get some logic out of the results.

The question is this one (see in line):

I collect a set of data (n = $100$ observations) containing a single predictor and a quantitative response. I then fit a linear regression
model to the data, as well as a separate cubic regression, i.e. $Y =
β_0 +β_1X +β_2X^2 +β_3X^3 +\varepsilon$.

(a) Suppose that the true relationship between X and Y is linear, i.e.
$Y = β_0 + β_1X + \varepsilon$. Consider the training residual sum of
squares (RSS) for the linear regression, and also the training RSS for
the cubic regression. Would we expect one to be lower than the other,
would we expect them to be the same, or is there not enough
information to tell? Justify your answer.

Given that the true relationship is linear, I thought adding a cubic term wouldn't improve the fit, it would take up degrees of freedom, and simple would increase the RSS. I tried to test this empirically:

training <- function(p, n) {
set.seed(1)
x <- rnorm(n)
y <- 5 + 2*x + rnorm(n, sd = 0.5)

sigma_linear <- summary(lm(y[p] ~ x[p]))$sigma
sigma_polynomial <- summary(lm(y[p] ~ x[p] + I(x[p]^2) + I(x[p]^3)))$sigma

c(linear = sigma_linear, polynomial = sigma_polynomial)  
}

training(1:20, 100)
 #    linear polynomial 
 # 0.4177607  0.4203941

If you run this, the linear RSS is smaller than the cubic one, which is what I would expect. But remove the set.seed option from the function and replicate this a couple of times and in many cases the cubic one will be higher. This might be simple random noise and the cubic term doesn't add or remove anything from the model. I don't understand why the cubic model would give a better 'fit'. In fact, in that exact link, both higher power terms are insignificant and the RSE is higher for the cubic model.

(b) Answer (a) using test rather than training RSS.

Now, using the remaining part of the data:

training(20:100, 100)
 #    linear polynomial 
 # 0.4957765  0.4984665

The cubic sigma is still higher. But again, replicate without the set.seed option and it will sometimes be smaller or higher.

Why do then people expect for the first cubic model to have a lower RSS in the linear model?

Best Answer

Here is the result of $\small 1,000$ simulations of your basic linear data-generating process, each time with $100$ data points, and each time fitted with polynomial regression models of degree $1$ to $5$, and getting the corresponding $5$ different SSR for each iteration:

mat = matrix(rep(0,5000), nrow=1000)
for(i in 1:1000){
  n = 100
  x = rnorm(n)
  y = 5 + 2 * x + rnorm(n, 0.5)
  for(j in 1:5){
  mat[i,j] = sum(residuals(lm(y ~ poly(x,j,raw=T)))^2)
  }              
}

Clearly, even if we use the same exact linear dataset for each iteration, the tendency is to a progressive decrease in $\text{RSS}$ with increasing polynomial degree in the model.

In fact these differences are statistically significant when running an ANOVA between the different polynomial models (p-value: 6.444e-10).

This, I think, answers your question.

BACKGROUND:

In general, the higher the order of the polynomial model, the better the fit is going to be to the data points in the training set. You can see a funny example of it in this post. Logically, the problem is overfitting, and the consequent out-of-sample error.

Just to frame the answer:

\begin{align} \text{SST}_{\text{otal}} &= \color{red}{\text{SSE}_{\text{xplained}}}+\color{blue}{\text{SSR}_{\text{esidual}}}\\ \end{align} or \begin{align}\small \sum(y_i-\bar y)^2 &=\small \color{red}{\sum(\hat y_i-\bar y)^2}+\color{blue}{\sum(y_i-\hat y_i)^2} \end{align}

In the OP, the R code calls for the Residual Standard Error instead of the sum of squares residual (SSR or RSS), which amounts to:

$$\small \text{RSE} = \sqrt{\frac{\sum(y_i - \hat y_i)^2}{\text{df}}}$$

However, it doesn't change what follows.

Here is an example with the dataset mtcars plotting the predicted polynomial lines on the data points of miles per gallon (mpg) versus weight (wt) of different car models, resulting from increasing polynomial order models from $1$ to $10$:

Although the relationship is fairly linear, as can be inferred from general knowledge about cars, and from inspecting the data points, as the the degree of the polynomial model increases the predicted points fall closer and closer to the actual data points because there is increased ability to accommodate individual variations resulting from noise in the data.

Concomitantly, there is a drop in the RSS:

all of it at the expense of overfitting and increased out-of-sample error.

Related Solutions

Solved – Cubic Fit to Underlying Linear Model

Your argument is correct. Since the true relationship is linear, the square and cubic terms in the cubic fit are not significant. And that is confirmed by the large p-values (0.215 and 0.461).

As mentioned by @mark999, it seems you are using a very large sample size because the standard errors are so small. If you follow the original question and use $n=100$, the numbers would be different but the conclusion will stay the same.

> set.seed(1)
> x <- runif(100, 0, 5)
> y <- 5 + 2*x + rnorm(x)
> linear.fit <- lm(y ~ x)
> cubic.fit <- lm(y ~ x + I(x^2) + I(x^3))
> summary(linear.fit)

Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.84978 -0.56222 -0.08707  0.52427  2.51661 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.82067    0.20581   23.42   <2e-16 ***
x            2.06247    0.07069   29.18   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9411 on 98 degrees of freedom
Multiple R-squared:  0.8968,    Adjusted R-squared:  0.8957 
F-statistic: 851.2 on 1 and 98 DF,  p-value: < 2.2e-16

> summary(cubic.fit)

Call:
lm(formula = y ~ x + I(x^2) + I(x^3))

Residuals:
     Min       1Q   Median       3Q      Max 
-1.85466 -0.59246 -0.09722  0.54144  2.48360 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.63484    0.46455   9.977  < 2e-16 ***
x            2.42391    0.74492   3.254  0.00157 ** 
I(x^2)      -0.16671    0.34102  -0.489  0.62607    
I(x^3)       0.02144    0.04535   0.473  0.63742    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9496 on 96 degrees of freedom
Multiple R-squared:  0.897,     Adjusted R-squared:  0.8938 
F-statistic: 278.7 on 3 and 96 DF,  p-value: < 2.2e-16

Best Answer

Related Solutions

Solved – Cubic Fit to Underlying Linear Model

Related Question