In the Wikipedia article on the bias-variance tradeoff, the independence of the estimator $\hat f(x)$ and the noise term $\epsilon$ is used in a crucial way in the proof of the decomposition of the mean square error. No justification for this independence is given, and I can't seem to figure it out. For example, if $f(t)=\beta_0t + \beta_1$, $Y_i=f(x_i) + \epsilon_i$ ($i=1,\ldots,n$), and $\hat f(x)=\hat\beta_0 + \hat\beta_1 x$ as in simple linear regression, are the $\epsilon_i$ independent of $\hat\beta_0$ and $\hat\beta_1$?
Solved – In linear regression, are the noise terms independent of the coefficient estimators
estimatorsindependencelinear modelregression
Related Solutions
$E\left(\frac{\sum (x_i - \bar{x})\beta_1 x_i}{S_{xx}}\right)$ = $\frac{\sum (x_i - \bar{x})\beta_1 x_i}{S_{xx}}$ because everything is constant. The rest is just algebra. Evidently you need to show $\sum (x_i - \bar{x}) x_i = S_{xx}$. Looking at the definition of $S_{xx}$ and comparing the two sides leads one to suspect $\sum(x_i - \bar{x}) \bar{x} = 0$. This follows easily from the definition of $\bar{x}$.
$Var\left(\frac{\sum (x_i - \bar{x})\epsilon}{S_{xx}}\right)$ = $\sum \left[\frac{(x_i - \bar{x})^2}{S_{xx}^2}\sigma^2\right] $. It simplifies, using the definition of $S_{xx}$, to the desired result.
This is tantamount to showing that $\widehat{\boldsymbol{\beta}}$, the vector of estimates, is independent of the residual vector $\mathbf{e}$. I trust that you are familiar with the matrix notation of the model, it makes the proof quite short.
Recall that the OLS estimator is given by $\left( \mathbf{X}^{T}\mathbf{X} \right)^{-1}\mathbf{X}^{T}\mathbf{Y}$ and the vector of residuals by $\left(\mathbf{I}-\mathbf{H} \right)\mathbf{Y}$, where $\mathbf{H}$ is the projection matrix given by $\mathbf{X}\left(\mathbf{X}^{T}\mathbf{X} \right)^{-1}\mathbf{X}^{T}$. Assuming that $\mathbf{Y}$ is multivariate normal, which is the assumption you need in order to construct finite-sample tests and confidence intervals, what we want to do is take advantage of the fact that linear combinations of multivariate normal variables are also multivariate normal. Hence we rewrite these two as follows
$$\begin{bmatrix} \widehat{\boldsymbol{\beta}} \\ \mathbf{e} \end{bmatrix}=\begin{bmatrix} \left( \mathbf{X}^{T}\mathbf{X} \right)^{-1}\mathbf{X}^{T} \\ \mathbf{I}-\mathbf{H} \end{bmatrix} \mathbf{Y}$$
And now we need to remember that if $\mathbf{X}\sim N_p \left(\boldsymbol{\mu}, \boldsymbol{\Sigma} \right)$, then $\mathbf{AX}\sim N_p \left(\mathbf{A}\boldsymbol{\mu}, \mathbf{A}\boldsymbol{\Sigma}\mathbf{A}^{T} \right)$ (the distribution is closed under affine transformations). We are mainly interested in the new covariance matrix as it being diagonal will indicate that the variables are independent and so we focus on that. It is easy to show-and I leave the details to you- that the covariance matrix is of the form
$$ \begin{bmatrix} \sigma^2 \left(\mathbf{X}^{T}\mathbf{X} \right)^{-1} & \mathbf{0} \\ \mathbf{0} & \sigma^2 \left(\mathbf{I}-\mathbf{H} \right) \end{bmatrix} $$
and so we may conclude that the random variables are independent. And since $\widehat{\boldsymbol{\beta}}$ is independent of $\mathbf{e}$, it is also independent of the Mean Squared Error $\frac{\mathbf{e}^{T}\mathbf{e}}{n-k}$, as we wanted to show.
Note that in general lack of correlation does not imply independence, this is a special property of the multivariate normal distribution and undoubtedly one of the reasons it is loved so much by statisticians.
Hope this helps.
Best Answer
No, they're not independent: In multiple linear regression the OLS coefficient estimator can be written as:
$$\begin{equation} \begin{aligned} \hat{\boldsymbol{\beta}} &= (\mathbf{x}^\text{T} \mathbf{x})^{-1} (\mathbf{x}^\text{T} \mathbf{y}) \\[6pt] &= (\mathbf{x}^\text{T} \mathbf{x})^{-1} \mathbf{x}^\text{T} (\mathbf{x} \boldsymbol{\beta} + \boldsymbol{\varepsilon}) \\[6pt] &= \boldsymbol{\beta} + (\mathbf{x}^\text{T} \mathbf{x})^{-1} \mathbf{x}^\text{T} \boldsymbol{\varepsilon}. \\[6pt] \end{aligned} \end{equation}$$
In regression problems we analyse the behaviour of the quantities conditional on the explanatory variables (i.e., conditional on the design matrix $\mathbf{x}$). The covariance between the coefficient estimators and errors is:
$$\begin{equation} \begin{aligned} \mathbb{Cov} ( \hat{\boldsymbol{\beta}}, \boldsymbol{\varepsilon} |\mathbf{x}) &= \mathbb{Cov} \Big( (\mathbf{x}^\text{T} \mathbf{x})^{-1} \mathbf{x}^\text{T} \boldsymbol{\varepsilon}, \boldsymbol{\varepsilon} \Big| \mathbf{x} \Big) \\[6pt] &= (\mathbf{x}^\text{T} \mathbf{x})^{-1} \mathbf{x}^\text{T} \mathbb{Cov} ( \boldsymbol{\varepsilon}, \boldsymbol{\varepsilon} | \mathbf{x} ) \\[6pt] &= (\mathbf{x}^\text{T} \mathbf{x})^{-1} \mathbf{x}^\text{T} \mathbb{V} ( \boldsymbol{\varepsilon} | \mathbf{x} ) \\[6pt] &= \sigma^2 (\mathbf{x}^\text{T} \mathbf{x})^{-1} \mathbf{x}^\text{T} \boldsymbol{I} \\[6pt] &= \sigma^2 (\mathbf{x}^\text{T} \mathbf{x})^{-1} \mathbf{x}^\text{T}. \\[6pt] \end{aligned} \end{equation}$$
In general, this covariance matrix is a non-zero matrix, and so the coefficient estimators are correlated with the error terms (conditional on the design matrix).
Special case (simple linear regression): In the special case where we have a simple linear regression with an intercept term and a single explanatory variable we have design matrix:
$$\mathbf{x} = \begin{bmatrix} 1 & x_1 \\[6pt] 1 & x_2 \\[6pt] \vdots & \vdots \\[6pt] 1 & x_n \\[6pt] \end{bmatrix},$$
which gives:
$$\begin{equation} \begin{aligned} (\mathbf{x}^\text{T} \mathbf{x})^{-1} \mathbf{x}^\text{T} &= \begin{bmatrix} n & & \sum x_i \\[6pt] \sum x_i & & \sum x_i^2 \\[6pt] \end{bmatrix}^{-1} \begin{bmatrix} 1 & 1 & \cdots & 1 \\[6pt] x_1 & x_2 & \cdots & x_n \\[6pt] \end{bmatrix} \\[6pt] &= \frac{1}{n \sum x_i^2 - (\sum x_i)^2} \begin{bmatrix} \sum x_i^2 & & -\sum x_i \\[6pt] -\sum x_i & & n \\[6pt] \end{bmatrix} \begin{bmatrix} 1 & 1 & \cdots & 1 \\[6pt] x_1 & x_2 & \cdots & x_n \\[6pt] \end{bmatrix} \\[6pt] &= \frac{1}{n \sum x_i^2 - (\sum x_i)^2} \begin{bmatrix} \sum x_i(x_i-x_1) & \cdots & \sum x_i(x_i-x_n) \\[6pt] -\sum (x_i-x_1) & \cdots & -\sum (x_i-x_n) \\[6pt] \end{bmatrix}. \\[6pt] \end{aligned} \end{equation}$$
Hence, we have:
$$\begin{equation} \begin{aligned} \mathbb{Cov}(\hat{\beta}_0, \varepsilon_k) &= \sigma^2 \cdot \frac{\sum x_i(x_i-x_k)}{n \sum x_i^2 - (\sum x_i)^2}, \\[10pt] \mathbb{Cov}(\hat{\beta}_1, \varepsilon_k) &= - \sigma^2 \cdot \frac{\sum (x_i-x_k)}{n \sum x_i^2 - (\sum x_i)^2}. \\[10pt] \end{aligned} \end{equation}$$
We can also obtain the correlation, which is perhaps a bit more useful. To do this we note that:
$$\mathbb{V}(\varepsilon_k) = \sigma^2 \quad \quad \quad \mathbb{V}(\hat{\beta}_0) = \frac{\sigma^2 \sum x_i^2}{n \sum x_i^2 - (\sum x_i)^2} \quad \quad \quad \mathbb{V}(\hat{\beta}_1) = \frac{\sigma^2 n}{n \sum x_i^2 - (\sum x_i)^2}.$$
Hence, we have correlation:
$$\begin{equation} \begin{aligned} \mathbb{Corr}(\hat{\beta}_0, \varepsilon_k) &= \frac{\sum x_i(x_i-x_k)}{\sqrt{(\sum x_i^2)(n \sum x_i^2 - (\sum x_i)^2)}}, \\[10pt] \mathbb{Corr}(\hat{\beta}_1, \varepsilon_k) &= - \frac{\sum (x_i-x_k)}{\sqrt{n(n \sum x_i^2 - (\sum x_i)^2)}}. \\[10pt] \end{aligned} \end{equation}$$