Multiple Regression – Is Fig 3.6 in Elements of Statistical Learning Correct?

feature selectionlassomultiple regressionstepwise regression

Here is the figure from the textbook:

It shows a decreasing relationship between subset size $k$ and mean squared error (MSE) of the true parameters, $\beta$ and the estimates $\hat{\beta}(k)$. Clearly, this shouldn't be the case – adding more variables to a linear model doesn't imply better estimates of the true parameters. What adding more variables does imply is a lower training error, i.e. lower residual sum of squares.

Is the $y$-axis labelled incorrectly? In particular, is it possible that the $y$ axis shows e.g. Residual Sum of Squares instead of $\mathbb{E}|| \hat{\beta}(k) – \beta||^2$?

EDIT:

Discussions and multiple attempts to reproduce revealed the axis is likely labelled correctly. In particular, it is not RSS since that will be on a completely different scale.

The title question still remains – "Is Figure 3.6 in ESL correct?". My intuition is that MSE should be lowest around the optimal $k$ (@SextusEmpiricus's answer suggests that's the case but there correlation is lower). Eyeballing Fig 3.6 we see MSE continues to go down beyond $k=10$.

In particular, I'm expecting to see curves similar to those in Figure 3.16:

It does show additional procedures due to that is on a different $x$-axis; it also uses different number of samples (300 vs 100). What is relevant here is the shape of e.g. "Forward stepwise" (common in both charts – orange in the first, black in the second) which exhibits quite different behaviour across the two figures.

Final Edit

Here you can find my attempt at replicating Fig3.6; plot shows different levels of correlation and number of non-zero parameters. Source code here.

Best Answer

It shows a decreasing relationship between subset size $k$ and mean squared error (MSE) of the true parameters, $\beta$ and the estimates $\hat{\beta}(k)$.

The plot shows the results of alternative subset selection methods. The image caption explains the experimental design: there are 10 elements of $\beta$ which are nonzero. The remaining 21 elements are zero. The ideal subset selection method will correctly report which $\beta$ are nonzero and which $\beta$ are zero; in other words, no features are incorrectly included, and no features are incorrectly excluded.

Omitted variable bias occurs when one or more features in the data generating process is omitted. Biased parameter estimates have expected values which do not equal their true values (this is the definition of bias), so the choice to plot $\mathbb{E}\|\beta -\hat{\beta}(k) \|^2$ makes sense. (Note that the definition of bias does not exactly coincide with this experimental setting because $\beta$ is also random.) In other words, the plot shows you how incorrect estimates are for various $k$ for various subset selection methods. When $k$ is too small (in this case, when $k<10$) the parameter estimates are biased, which is why the graph shows large values of $\mathbb{E}\|\beta -\hat{\beta}(k) \|^2$for small $k$.

Clearly, this shouldn't be the case - adding more variables to a linear model doesn't imply better estimates of the true parameters.

Fortunately, that's not what the plot shows. Instead, the plot shows that employing subset selection methods can produce correct or incorrect results depending on the choice of $k$.

However, this plot does show a special case when adding additional features does improve the parameter estimates. If one builds a model that exhibits omitted variable bias, then the model which includes those variables will achieve a lower estimation error of the parameters because omitted variable bias is not present.

What adding more variables does imply is a lower training error, i.e. lower residual sum of squares.

You're confusing the demonstration in this passage with an alternative which does not employ subset selection. In general, estimating a regression with a larger basis decreases the residual error as measured using the training data; that's not what's happening here.

Is the $y$-axis labelled incorrectly? In particular, is it possible that the $y$ axis shows Residual Sum of Squares instead of $\mathbb{E}\|\beta -\hat{\beta}(k) \|^2$?

I don't think so; the line of reasoning posited in the original post does not itself establish that the label is incorrect. Sextus' experiments find a similar pattern; it's not identical, but the shape of the curve is similar enough.

As an aside, I think that since this plot displays empirical results from an experiment, it would be clearer to write out the estimator used for the expectation, per Cagdas Ozgenc's suggestion.

Is Figure 3.6 in ESL correct?

The only definitive way to answer this question is to obtain the code used to generate the graph. The code is not publicly available or distributed by the authors.

Without access to the code used in the procedure, it's always possible that there was some mistake in labeling the graph, or in the scale/location of the data or coefficients; the fact that Sextus has had problems recreating the graph using the procedure described in the caption provides some circumstantial evidence that the caption might not be completely accurate. One might argue that these reproducibility problems support a hypothesis that the labels themselves or the graphed points may be incorrect. On the other hand, it's possible that the description is incorrect but the label itself is correct nonetheless.

A different edition of the book publishes a different image. But the existence of a different image does not imply that either one is correct.

Related Solutions

Solved – Variance vs Standard Deviation vs SE vs Var(beta hat)

I do not have the time to give a detailed answer, but since no one has helped you so far, I will give some hints.

Is this correct? and Is s as I have defined it?

Your formulas seem to be fine, they just lack the summations, as noted by @Gilles.

So why is standard error the standard deviation divided by square root N?

Standard error is useful for significance testing. It is used in $t$-tests. Without the division by $\sqrt{N}$ the $t$-tests would not work. You should be able to find more information in econometrics textbooks.

I am guessing that it is using beta-hat as a substitute for all parameters in the model?

That could very well be the case. $\hat{\beta}$ would be a parameter vector rather than a single parameter. Its variance would be a matrix rather than a single number.

"variance of the unknown errors" <--- don't know what that means

In a linear regression model of the form $y=\beta_0+\beta_1 x_1+\dotsb+\beta_K x_K+\varepsilon$, variance of the unknown errors would be the variance of $\varepsilon$, denoted $\sigma^2$; it is a single number (not a matrix). It can be estimated as the sample variance of the model residuals, denoted $\hat{\sigma}^2$.

However, the variance of the linear regression parameter vector $\hat{\beta}$ is $\text{Var}(\hat{\beta}|X)=\sigma^2 (X^T X)^{-1}$, which is not the variance of the unknown errors multiplied by the indentity matrix.

A note on what you did not ask: if you want to really understant what is going on in the OLS estimation applied on a linear regression model, you should take your time and study the subject carefully and with patience. Just getting your questions answered like here may not be useful in the long run.

Regression – How Adding Regressors Changes the Estimated Variance of Error Terms

Here's a little R simulation that confirms your intuition:

set.seed(9955)

N <- 10^3
p <- 100
X <- matrix(rnorm(N*p, mean=0, sd=1), N, p)
colnames(X) <- sprintf("x_%s", seq_len(ncol(X)))
y <- rnorm(N, mean=0, sd=1)  # True betas are zero
df <- as.data.frame(cbind(y, X))
names(df) <- gsub("V", "x_", names(df))

estimated_sigma_squared <- sapply(seq_len(p), function(k) {
    message("estimating model with constant and ", k, " Xs")
    model_formula <- reformulate(response="y", termlabels=sprintf("x_%s", seq_len(k)))
    model <- lm(formula=model_formula, data=df)
    sigma_squared_hat <- sum(residuals(model)^2) / (N - k - 1)
    return(sigma_squared_hat)
})

plot(estimated_sigma_squared, type="l")
any(diff(estimated_sigma_squared) > 0)  # True

A ggplot2 version of the plot:

Note that the maximum likelihood estimate of $\sigma^2$ has $N$ in the denominator (as opposed to $N-k-1$), and therefore cannot increase with $k$.

Best Answer

Related Solutions

Solved – Variance vs Standard Deviation vs SE vs Var(beta hat)

Regression – How Adding Regressors Changes the Estimated Variance of Error Terms

Related Question