Linear Regression Prediction Errors

linear regressionmachine learningregressionstatistics

Suppose that we perform linear regression on data $\mathbf{X}$ (an $N \times {(D+1)}$ matrix) and predictions $\mathbf{y}$ (an $N \times 1$ vector). Let $\mathbf{w}$ ($(D+1) \times 1$ vector) be the optimal parameters for minimizing the expected loss. Why is it true that for any vector $\mathbf{\tilde{w}}$
$$\mathbb{E}_{\rho(x_0,y_0)}\left[(y_0-\mathbf{w}^{T}{x_0})(\mathbf{w}-\mathbf{\tilde{w}})^{T} x_0\right]=0?$$
This implies that the the prediction error is uncorrelated with linear functions, but why is this true?

So far, I've tried using the fact that $\mathbf{w} = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}$, but I'm not sure how useful that would be.

Best Answer

This statement is false.

I suppose you assume that $y$ was generated by a linear model, $y=x^Tv+\varepsilon$, with $x$ and $\varepsilon$ independent, and $E[\varepsilon]=0$. If so, the statement holds for $w=v$: $$ E_{\rho(x_0,y_0)}[(y_0−v^Tx_0)(v−\tilde{w})^Tx_0]=E_{\rho(x_0,y_0)}[\varepsilon_0(v−\tilde{w})^Tx_0] $$ $$ =E_{\rho(x_0,y_0)}[\varepsilon_0](v−\tilde{w})^TE_{\rho(x_0,y_0)}[x_0]=0 $$ However, since your $w$ is an estimator, dependent on random observations, you generally won't have $w=v$. The estimator you propose takes the form: $$ w=(X^TX)^{-1}X^Ty=(X^TX)^{-1}X^T(Xv+\varepsilon)=v+(X^TX)^{-1}X^T\varepsilon=:v+L\varepsilon $$ So, in fact you have: $$ E_{\rho(x_0,y_0)}[(y_0−w^Tx_0)(w−\tilde{w})^Tx_0] =E_{\rho(x_0,y_0)}[(y_0−(v+L\varepsilon)^Tx_0)(v+L\varepsilon−\tilde{w})^Tx_0] $$ Exploiting linearity: $$ =E_{\rho(x_0,y_0)}[(y_0-v^Tx_0)(v−\tilde{w})^Tx_0]+ E_{\rho(x_0,y_0)}[(y_0-v^Tx_0)(L\varepsilon)^Tx_0]- E_{\rho(x_0,y_0)}[(L\varepsilon)^Tx_0(v−\tilde{w})^Tx_0]- E_{\rho(x_0,y_0)}[(L\varepsilon)^Tx_0(L\varepsilon)^Tx_0] $$ We previously saw that the first term is zero. Using the definition of $\varepsilon_0$, $$ =0+E_{\rho(x_0,y_0)}[\varepsilon_0(L\varepsilon)^Tx_0]- E_{\rho(x_0,y_0)}[(L\varepsilon)^Tx_0(v−\tilde{w})^Tx_0]- E_{\rho(x_0,y_0)}[(L\varepsilon)^Tx_0 x_0^T(L\varepsilon)] $$ This can be further simplified: $$ =0-(L\varepsilon)^T E_{\rho(x_0,y_0)}[x_0(v−\tilde{w})^Tx_0]-(L\varepsilon)^TE_{\rho(x_0,y_0)}[x_0 x_0^T](L\varepsilon) $$ If $\Sigma$ is the covariance matrix of $x_0$, we have: $$ =-\varepsilon^TL^T E_{\rho(x_0,y_0)}[x_0(v−\tilde{w})^Tx_0]-\varepsilon^T L^T\Sigma L\varepsilon $$ This term is random (as it depends on $\varepsilon$), and generally won't be zero. If you want, you can take the expectation with respect to $\varepsilon$ or even $L$. This will eliminate the frist term. The second term, however, is a quadratic form, and will not be zero almost surely.

Related Question