[Math] How to find the closed form formula for $\hat{\beta}$ while using ordinary least squares estimation

calculuslinear algebraregressionstatistics

According to Wikipedia's article on Linear Regression:

Given a data set $\{y_i,x_{i1},\ldots,x_{ip}\}_{i=1}^{i=n}$ of $n$
statistical units, a regression model assumes that the relationship
between the dependent variable $y_i$ and the $p-\text{vector}$ or
regressors $x_i$ is linear. This relationship is modelled through a
disturbance term or error variable $\varepsilon_i$-an unobserved random
variable that adds random noise to the to the linear relationship
between the dependent variables and regressors. The model takes the
form
$y_i=\beta_0(1)+\beta_1x_{i1}+\cdots+\beta_px_{ip}+\epsilon_i=x_i^T \beta + \varepsilon_i$

These equations can be written in vector form as $$y=\mathbf{X\beta+\epsilon}$$

For the Ordinary Least Square estimation they say that the closed form expression for the estimated value of the unknown parameter $\beta$ is

$$\hat{\mathbf{\beta}}=(\mathbf{X^{T}X})^{-1}\mathbf{X}^{T}y$$

I'm not sure how they get this formula for $\hat{\beta}$. It would be very nice if someone can explain me the derivation.

Best Answer

I'm going to show this using partial differentiation.

Consider the assumed linear model $$y_i = \mathbf{x}_i^{T}\boldsymbol\beta + \epsilon_i$$ where $y_i, \epsilon_i \in \mathbb{R}$ and $\mathbf{x}_i=\begin{bmatrix} x_{i0} \\ x_{i1} \\ \vdots \\ x_{ip} \end{bmatrix}, \boldsymbol\beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix} \in \mathbb{R}^{p+1}$ for $i = 1, \dots, n$, with $x_{i0} = 1$.

Our aim is to solve for $\hat{\boldsymbol\beta}$ by minimizing the residual sum of squares, or minimizing $$\text{RSS}(\boldsymbol\beta) = \sum_{i=1}^{n}(y_i-\mathbf{x}_i^{T}\boldsymbol\beta)^2\text{.}$$ To compute this sum, consider the vector of residuals $$\mathbf{e}=\begin{bmatrix} y_1 - \mathbf{x}_1^{T}\boldsymbol\beta \\ y_2 - \mathbf{x}_2^{T}\boldsymbol\beta \\ \vdots \\ y_n - \mathbf{x}_n^{T}\boldsymbol\beta \end{bmatrix}$$ Then $\text{RSS}(\boldsymbol\beta) = \mathbf{e}^{T}\mathbf{e}$. Our next step is to find the "partial derivatives" of $\text{RSS}(\boldsymbol\beta)$.

To do this, note that for $k = 1, \dots, p$, $$\dfrac{\partial \text{RSS}}{\partial \beta_k}=\dfrac{\partial}{\partial\beta_k}\left\{\sum_{i=1}^{n}\left[y_i- \sum_{j=0}^{p}\beta_jx_{ij}\right]^2 \right\}=-2\sum_{i=1}^{n}x_{ik}\left(y_i - \sum_{j=0}^{p}\beta_jx_{ij}\right)\text{.}$$ "Stacking" these, we obtain $$\begin{align} \dfrac{\partial \text{RSS}}{\partial \boldsymbol\beta}&=\begin{bmatrix} \dfrac{\partial \text{RSS}}{\partial \beta_0} \\ \dfrac{\partial \text{RSS}}{\partial \beta_1} \\ \vdots \\ \dfrac{\partial \text{RSS}}{\partial \beta_p} \end{bmatrix} \\ &= \begin{bmatrix} -2\sum_{i=1}^{n}x_{i0}\left(y_i - \sum_{j=0}^{p}\beta_jx_{ij}\right) \\ -2\sum_{i=1}^{n}x_{i1}\left(y_i - \sum_{j=0}^{p}\beta_jx_{ij}\right) \\ \vdots \\ -2\sum_{i=1}^{n}x_{ip}\left(y_i - \sum_{j=0}^{p}\beta_jx_{ij}\right) \end{bmatrix} \\ &= -2\begin{bmatrix} \sum_{i=0}^{n}x_{i0}(\mathbf{y}-\mathbf{x}_1^{T}\boldsymbol\beta)\\ \sum_{i=0}^{n}x_{i1}(\mathbf{y}-\mathbf{x}_1^{T}\boldsymbol\beta) \\ \vdots \\ \sum_{i=0}^{n}x_{ip}(\mathbf{y}-\mathbf{x}_1^{T}\boldsymbol\beta) \end{bmatrix} \\ &= -2\left(\begin{bmatrix} \sum_{i=0}^{n}x_{i0}\mathbf{y}\\ \sum_{i=0}^{n}x_{i1}\mathbf{y} \\ \vdots \\ \sum_{i=0}^{n}x_{ip}\mathbf{y} \end{bmatrix} - \begin{bmatrix} \sum_{i=0}^{n}x_{i0}\mathbf{x}_1^{T}\boldsymbol\beta)\\ \sum_{i=0}^{n}x_{i1}\mathbf{x}_1^{T}\boldsymbol\beta) \\ \vdots \\ \sum_{i=0}^{n}x_{ip}\mathbf{x}_1^{T}\boldsymbol\beta) \end{bmatrix}\right)\\ &= -2(\mathbf{X}^{T}\mathbf{y}-\mathbf{X}^{T}\mathbf{X}\boldsymbol\beta)\text{.} \end{align}$$ where $$\mathbf{X} = \begin{bmatrix} \mathbf{x}_1^{T} \\ \mathbf{x}_2^{T} \\ \vdots \\ \mathbf{x}_n^{T} \end{bmatrix}\text{.}$$ Setting $\dfrac{\partial \text{RSS}}{\partial \boldsymbol\beta} = \mathbf{0}$, we obtain $$\mathbf{X}^{T}\mathbf{X}\boldsymbol\beta = \mathbf{X}^{T}\mathbf{y}$$ and assuming $\mathbf{X}^{T}\mathbf{X}$ is invertible, $$\hat{\boldsymbol\beta} = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}$$