It shows a decreasing relationship between subset size $k$ and mean squared error (MSE) of the true parameters, $\beta$ and the estimates $\hat{\beta}(k)$.
The plot shows the results of alternative subset selection methods. The image caption explains the experimental design: there are 10 elements of $\beta$ which are nonzero. The remaining 21 elements are zero. The ideal subset selection method will correctly report which $\beta$ are nonzero and which $\beta$ are zero; in other words, no features are incorrectly included, and no features are incorrectly excluded.
Omitted variable bias occurs when one or more features in the data generating process is omitted. Biased parameter estimates have expected values which do not equal their true values (this is the definition of bias), so the choice to plot $\mathbb{E}\|\beta -\hat{\beta}(k) \|^2$ makes sense. (Note that the definition of bias does not exactly coincide with this experimental setting because $\beta$ is also random.) In other words, the plot shows you how incorrect estimates are for various $k$ for various subset selection methods. When $k$ is too small (in this case, when $k<10$) the parameter estimates are biased, which is why the graph shows large values of $\mathbb{E}\|\beta -\hat{\beta}(k) \|^2$for small $k$.
Clearly, this shouldn't be the case - adding more variables to a linear model doesn't imply better estimates of the true parameters.
Fortunately, that's not what the plot shows. Instead, the plot shows that employing subset selection methods can produce correct or incorrect results depending on the choice of $k$.
However, this plot does show a special case when adding additional features does improve the parameter estimates. If one builds a model that exhibits omitted variable bias, then the model which includes those variables will achieve a lower estimation error of the parameters because omitted variable bias is not present.
What adding more variables does imply is a lower training error, i.e. lower residual sum of squares.
You're confusing the demonstration in this passage with an alternative which does not employ subset selection. In general, estimating a regression with a larger basis decreases the residual error as measured using the training data; that's not what's happening here.
Is the $y$-axis labelled incorrectly? In particular, is it possible that the $y$ axis shows Residual Sum of Squares instead of $\mathbb{E}\|\beta -\hat{\beta}(k) \|^2$?
I don't think so; the line of reasoning posited in the original post does not itself establish that the label is incorrect. Sextus' experiments find a similar pattern; it's not identical, but the shape of the curve is similar enough.
As an aside, I think that since this plot displays empirical results from an experiment, it would be clearer to write out the estimator used for the expectation, per Cagdas Ozgenc's suggestion.
Is Figure 3.6 in ESL correct?
The only definitive way to answer this question is to obtain the code used to generate the graph. The code is not publicly available or distributed by the authors.
Without access to the code used in the procedure, it's always possible that there was some mistake in labeling the graph, or in the scale/location of the data or coefficients; the fact that Sextus has had problems recreating the graph using the procedure described in the caption provides some circumstantial evidence that the caption might not be completely accurate. One might argue that these reproducibility problems support a hypothesis that the labels themselves or the graphed points may be incorrect. On the other hand, it's possible that the description is incorrect but the label itself is correct nonetheless.
A different edition of the book publishes a different image. But the existence of a different image does not imply that either one is correct.
Best Answer
With nested models, it's possible to conceive of a saturated model. This provides a theoretical upper bound to the likelihood and parameter space. When you have this, you know for a fact that the long-range behavior of the log likelihood ratio is a $\chi^2_{p-q}$ random variable. This is necessary for conducting formal inference. The $F$ statistic is the actual distribution of the exact test for normally distributed variables, but the $F$ converges to a $\chi^2$ distribution as the denominator degrees of freedom goes to infinity.
With non-nested models, it's possible to have arbitrary high and low likelihoods. Then there are no guarantees about the long range behavior of the statistic. This means some scenarios give you very high false negative, very high false positive probabilities, or both with no way of calibrating the test.
You can qualitatively compare non-nested models using the AIC or BIC. But you cannot make formal inference on their relative impact, you just have to say, "Model Y had a higher/lower IC than Model X".