Solved – Proof that the expected MSE is smaller in training than in test

machine learningregressiontraining error

This is Exercise 2.9 (p. 40) of the classic book "The Elements of Statistical Learning", second edition, by Hastie, Tibshirani and Friedman. In the book, it's mentioned that this exercise was brought to the authors' attention by Ryan Tibshirani, from a homework assignment by Andrew Ng.

Ex. 2.9. Consider a linear regression model with $p$ parameters, fit by least squares to a set of training data $(x_1,y_1),\dots,(x_N,y_N)$ drawn at random from a population. Let $\hat{\beta}$ be the least squares estimate. Suppose we have some test data $(\tilde{x}_1,\tilde{y}_1),\dots,(\tilde{x}_M,\tilde{y}_M)$ drawn at random from the same population as the training data. If $R_{tr}(\beta)=\frac{1}{N}\sum_{i=1}^N(y_i-\beta^T x_i)^2$ and $R_{te}(\beta)=\frac{1}{M}\sum_{i=1}^M(\tilde{y}_i-\beta^T \tilde{x}_i)^2$, prove that
$$
\mathrm{E}[R_{tr}(\hat{\beta})]\leq\mathrm{E}[R_{te}(\hat{\beta})],
$$

where the expectations are over all that is random in each expression.

Best Answer

This is an interesting problem which makes you think about what is random in your computations. Here is my take.

The least squares estimate $\hat{\beta}$ is the solution of $$ \arg \min_{\beta\in\mathbb{R}^{p+1}} \sum_{k=1}^N (y_k-\beta^T x_k). $$

Hence, if you consider the random training data $(X_1,Y_1),\dots,(X_n,Y_n)$ as IID pairs from some unknown distribution function $F_{X,Y}$, we can imagine the random vector $\hat{\beta}$ (the least squares estimator) as some functional $\hat{\Psi}[(X_1,Y_1),\dots,(X_n,Y_n)]$, with suitable measurability conditions, which satisfies $$ \sum_{k=1}^N (Y_k-\hat{\beta}^T X_k)^2 \leq \sum_{k=1}^N (Y_k-\beta^T X_k)^2, \qquad (*) $$ almost surely, for every random vector $\beta$.

The symmetry of the IID assumption yields that $$ \frac{1}{N}\sum_{k=1}^N \mathrm{E}[(Y_k-\beta^T X_k)^2] = \mathrm{E}[(Y_i-\beta^T X_i)^2], $$ for $i=1,\dots,N$, and every random vector $\beta$.

Therefore, dividing by $N$ and taking expectations in $(*)$, we have that $$ \mathrm{E}[(Y_i-\hat{\beta}^T X_i)^2] \leq \mathrm{E}[(Y_i-\beta^T X_i)^2], \qquad (*') $$ for $i=1,\dots,N$, and every random vector $\beta$.

The key point here is that, since $(*')$ holds for every random vector $\beta$, it must hold for the random vector $$ \beta = \frac{1}{||X_i||^2}\left(Y_iX_i - \tilde{Y}_jX_i + X_i\tilde{X}_j^T\hat{\beta} \right), $$ for any choice of $j=1,\dots,M$. Using this $\beta$ in $(*')$ we have that $$ \mathrm{E}[(Y_i-\hat{\beta}^T X_i)^2] \leq \mathrm{E}[(\tilde{Y}_j-\hat{\beta}^T \tilde{X_j})^2], $$ for every $i=1,\dots,N$, and every $j=1,\dots,M$.

The last inequality and the IID assumption (in the same way we used it before) imply that $$ \frac{1}{N} \sum_{k=1}^N \mathrm{E}[(Y_k-\hat{\beta}^T X_k)^2] \leq \frac{1}{M} \sum_{k=1}^M \mathrm{E}[(\tilde{Y}_k-\hat{\beta}^T \tilde{X}_k)^2], $$ so $$ \mathrm{E}[R_{tr}(\hat{\beta})] \leq \mathrm{E}[R_{te}(\hat{\beta})]. $$

${}$