How $\hat{y}_0=x_0^T\beta+\sum^N_{i=1}l_i(x_0)\epsilon_i$

regression

I am currently reading "Elements of Statistical Learning II" and I am not quite sure about one thing in section 2.5 on p.24 and p.26

So at the end of p.24 they write the following:

Suppose that we know that the relationship between $Y$ and $X$ is linear,

$$Y=X^T\beta + \epsilon$$

where $\epsilon \sim N(0,\sigma^2)$ and we fit the model by least squares to the training data. For an arbitrary test point $x_0$, we have $\hat{y}_0=x_0^T\hat{\beta}$, which can be written as $\hat{y}_0=x_0^T\beta+\sum^N_{i=1}l_i(x_0)\epsilon_i$, where $l_i(x_0)$ is the $i$th elements of $X(X^TX)^{-1}x_0$.

I'm struggling to derive how did they arrive to this equation. I know that $\hat{\beta}=(X^TX)^{-1}X^TY$ and so $x_0^T\hat{\beta}=x_0^T(X^TX)^{-1}X^TY$

Since $Y=X^T\beta+\epsilon$ then $x_0^T\hat{\beta}=x_0^T(X^TX)^{-1}X^TY=x_0^T(X^TX)^{-1}X(X^T\beta+\epsilon)$

Well, i'm not sure how exactly to arrive to the equation they do. Can someone show me how?

Best Answer

The model formulation $\mathbf{Y} = \mathbf{X}^\text{T} \boldsymbol{\beta} + \boldsymbol{\varepsilon}$ is an unusual framing of the regression model (usually $\mathbf{X}$ would be the design matrix but here it is the transpose of the design matrix). With this formulation you have the OLS estimator $\boldsymbol{\hat{\beta}} = (\mathbf{X} \mathbf{X}^\text{T})^{-1} (\mathbf{X} \mathbf{Y})$. You have:

$$\begin{align} \hat{\mathbf{Y}} &= \mathbf{X}^\text{T} \boldsymbol{\hat{\beta}} \\[6pt] &= \mathbf{X}^\text{T} (\mathbf{X} \mathbf{X}^\text{T})^{-1} (\mathbf{X} \mathbf{Y}) \\[6pt] &= \mathbf{X}^\text{T} (\mathbf{X} \mathbf{X}^\text{T})^{-1} \mathbf{X}[\mathbf{X}^\text{T} \boldsymbol{\beta} + \boldsymbol{\varepsilon}] \\[6pt] &= \mathbf{X}^\text{T} (\mathbf{X} \mathbf{X}^\text{T})^{-1} \mathbf{X}\mathbf{X}^\text{T} \boldsymbol{\beta} + \mathbf{X}^\text{T} (\mathbf{X} \mathbf{X}^\text{T})^{-1} \mathbf{X} \boldsymbol{\varepsilon} \\[6pt] &= \mathbf{X}^\text{T} \boldsymbol{\beta} + [\mathbf{X}^\text{T} (\mathbf{X} \mathbf{X}^\text{T})^{-1} \mathbf{X}] \boldsymbol{\varepsilon} \\[6pt] &= \mathbf{X}^\text{T} \boldsymbol{\beta} + \mathbf{h} \boldsymbol{\varepsilon}, \\[6pt] \end{align}$$

where $\mathbf{h}$ is the hat matrix. So you have:

$$\begin{align} \hat{Y}_i = \mathbf{X}_i^\text{T} \boldsymbol{\beta} + [\mathbf{h} \boldsymbol{\varepsilon}]_i = \mathbf{X}_i^\text{T} \boldsymbol{\beta} + \sum_{j = 1}^n h_{i,j} \varepsilon_j. \\[6pt] \end{align}$$

where $h_{i,\ell} = [\mathbf{X}^\text{T} (\mathbf{X} \mathbf{X}^\text{T})^{-1} \mathbf{X}]_{i, \ell} = [\mathbf{X}^\text{T} (\mathbf{X} \mathbf{X}^\text{T})^{-1} \mathbf{X}_i]_{\ell} $. They use different notation than this (the correspondence is $\ell_j(\mathbf{X}_i) = [\mathbf{X}^\text{T} (\mathbf{X} \mathbf{X}^\text{T})^{-1} \mathbf{X}_i]_j$) but it is the same result.

Related Question