Ordinary Least Squares Estimate

linear algebralinear regressionregressionstatistics

In this paper (https://www.stat.berkeley.edu/~brill/Papers/lehmannfest.pdf), the author claims that (and more generally elsewhere I have seen that it is usually written as that) the Ordinary Least Squares estimate for the classical problem $y_i = \beta_0 + \beta_1^T x_i$ plus some noise (where $\beta_1, x_i$ are $p$-dimensional vectors) satisfies:

$$\hat{\beta_1} = \left( \sum_{i=1}^n (x_i – \bar{x}) (x_i – \bar{x})^T \right)^{-1} \left( \sum_{i=1}^n y_i (x_i – \bar{x}) \right) $$

Nevertheless, what I know "in the modern days" as the classical ordinary least squares estimate is the following quantity, see e.g. https://en.wikipedia.org/wiki/Linear_least_squares: $$\hat{\beta} = \left( \sum_{i=1}^n x_i x_i^T \right)^{-1} \left( \sum_{i=1}^n y_i x_i \right) $$

However, preliminary calculations of mine do not indicate that the two above quantities are equal, as they should be for the paper to hold. Does anyone have any idea why they are equal, if they are? Maybe it has to somehow do with the conditions of the least squares estimate? What am I missing?

Best Answer

In what follows, we work with matrix-vector products, which means we multiply column vectors on the left by matrices (thus differing from the reference).

Let S = $\{ (x_{i}, y_{i})\}_{i=1}^{n}$ denote the collection of data, where $x_{i} \in \mathbb{R}^{p}$ and $y_{i} \in \mathbb{R}$ for $i=1,\dots,n.$ In the first case, we consider a mean-centered linear model: $$y_{i} = a_{0} + \sum_{j=1}^{p} a_{j} (x_{ij} - \bar{x}_{j}) + \epsilon_{i} \qquad for \quad i=1\dots,n,$$ where $\bar{x}_{j} = \frac{\sum_{i=1}^{n} x_{ij}}{n}$ is the center of mass for the $j$th feature.

Now, we express it in matrix notation (for convenience). Namely, let $$X-\bar{X} = \begin{pmatrix} 1 & x_{11} & \cdots & x_{1p} \\ 1 & x_{21} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & \cdots & x_{np} \\ \end{pmatrix} - \begin{pmatrix} 0 & \bar{x}_{1} & \cdots & \bar{x}_{p} \\ 0 & \bar{x}_{1} & \cdots & \bar{x}_{p} \\ \vdots & \vdots & \ddots & \vdots \\ 0 & \bar{x}_{1} & \cdots & \bar{x}_{p} \\ \end{pmatrix}, $$ $\epsilon = \begin{pmatrix} \epsilon_{1} & \epsilon_{2} & \cdots & \epsilon_{n} \end{pmatrix}^{t}, $ and $Y = \begin{pmatrix} y_{1} & y_{2} & \cdots & y_{n} \end{pmatrix}^{t}. $ Then, we have $Y = (X - \bar{X})a + \epsilon,$ where $a = \begin{pmatrix} a_{0} & a_{1} & \cdots & a_{p} \end{pmatrix}^{t}, $ and the right hand side simplifies to $Xa - \bar{X}a + \epsilon.$

In the second case, we consider a linear model: $$y_{i} = b_{0} + \sum_{j=1}^{p} b_{j} x_{ij} + \epsilon_{i} \qquad for \quad i=1\dots,n.$$ Subsequently, we express it in matrix notation such that $Y = Xb + \epsilon,$ where $X$ and $\epsilon$ are as above and $b$ must be deduced. To find the elements of $b,$ we rearrange the right hand side of the mean-centered linear model to obtain: $$y_{i} = a_{0} - \sum_{j=1}^{p} a_{j} \bar{x}_{j} + \sum_{j=1}^{p}a_{j}x_{ij} + \epsilon_{i} \qquad for \quad i=1\dots,n.$$ This suggests $b_{0} = a_{0} - \sum_{j=1}^{p} a_{j} \bar{x}_{j}$ and $b_{j} = a_{j}$ for $j=1,\dots,p,$ so we have found all of the elements in $b.$ Hence, we observe that both linear models are identical. (From here, derive the estimates.)

A note of caution: in (1.3) of the reference, they do not apply the inverse, whereas your expression contains the inverse. In your case, you have assumed invertibility, which is equivalent to the $p$ features being linearly independent.