Linear Regression: Correlation between predictors and residuals

data analysislinear regressionmachine learningprobabilitystatistics

I am reading Chapter 3 from Elements of Statistical Learning. In the explanation for Forward Stagewise Regression and Least Angle Regression, the authors explain that reducing the correlation between the predictors and the residuals amounts to moving in the direction of the standard linear regression. I also read online that in the standard linear regression (minimizing least squares), the predictors are uncorrelated with the residuals. I am not able to prove this. I have tried the following so far:

$r_i = y_i – x_i^T\hat{\beta}$

$x_i^Tr_i = x_i^Ty_i – x_i^T\hat{\beta}$

The above expression does not evaluate to $0$ for the general case. Ideally, I think I should compute $E[rX]$. I don't know how to get an empirical formula for it, since r is a vector and X is a matrix. Could someone please help me out? I am really confused. Also, it would be helpful if someone could point me to a resource explaining these correlations in more detail. I know that they should approximately have a Gaussian distribution since they are an estimate of the noise.

Best Answer

Let $\mathbf{r}=[r_1,r_2,\ldots,r_n]^{\top}$, $\mathbf{y}=[y_1,y_2,\ldots,y_n]^{\top}$, and $\mathbf{X}=[x_1,x_2,\ldots,x_n]^{\top}$. Then \begin{align} \mathbf{r}^{\top}\mathbf{X}&=\big(\mathbf{y}-\mathbf{X}\hat{\beta}\big)^{\top}\mathbf{X} \\ &=\mathbf{y}^{\top}\mathbf{X}-\left(\mathbf{X}(\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}^{\top}\mathbf{y}\right)^{\top}\mathbf{X} \\ &=\mathbf{y}^{\top}\mathbf{X}-\mathbf{y}^{\top}\mathbf{X}(\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}^{\top}\mathbf{X} \\ &=\mathbf{y}^{\top}\mathbf{X}-\mathbf{y}^{\top}\mathbf{X}=0, \end{align} which implies that the sample correlation between the residuals and the regressors is $0$ (note that $\mathbf{r}^{\top}\mathbf{i}=\sum_{i=1}^n r_i=0$).