Linear Algebra – Understanding Linear Algebra in OLS Derivation

least squareslinear algebraregression

In ordinary least squared there is this equation (Kevin Murphy book page 221, latest edition)

$$NLL(w)=\frac{1}{2}({y-Xw})^T(y-Xw)=\frac{1}{2}w^T(X^TX)w-w^T(X^T)y$$

I am not sure how the RHS equals the LHS. Maybe my linear algebra is weak but I can't figure out how this happens. Can somebody point out how this happens. This is related to deriving the Ordinary Least squares equation where $\hat{w}_{OLS}=(X^TX)^{-1}X^Ty$

NLL – stands for negative loglikelihood.

I am attaching the screen shots of the relevant section. My equation is in the 1st image(page 221). I actually bought the book so I am hoping that showing 2 pages is not a copyright infringement. source (Kevin Murphy, Machine Learning, A probabilistic perspective)

enter image description here
enter image description here

Best Answer

The sum of squares expands to:

$$ \begin{align*} \frac{1}{2}({y-Xw})^T(y-Xw) &= \frac{1}{2}\left( y^\top y - y^\top Xw - \left( X w \right) ^\top y + \left( X w \right)^\top X w \right) \\ &= \frac{1}{2}\left( y^\top y - y^\top Xw - w^\top X^\top y + w^\top X^\top X w \right) \\ &= \frac{1}{2}\left(y^{\top}y - 2 y^\top X w + w^\top X^\top X w \right) \end{align*} $$

Note that $y^\top X w$ is a 1 by 1 matrix (i.e. scalar) hence it is always symmetric and $y^\top X w = w^\top X^\top y$.

Observe that the minimization problem: $$ \text{minimize(over $w$) } \quad \frac{1}{2}y^{\top}y - y^\top X w + \frac{1}{2} w^\top X^\top X w $$ Has an equivalent solution for $w$ as: $$ \text{minimize(over $w$) } \quad \frac{1}{2}w^\top X^\top X w - y^\top X w $$ This is because the scalar $y^\top y$ doesn't affect the optimal choice of $w$. Also observe that objective is a convex function of w, hence the first order conditions are both necessary and sufficient conditions for an optimum. The first order condition is: $$ \frac{\partial \mathcal{L}}{\partial w} = \frac{1}{2}\left(X^\top X + (X^\top X)^\top \right)w - X^\top y = 0 $$ $$ X^\top X w = X^\top y$$

If $X^\top X $ is full rank (i.e. the columns of $X$ are linearly independent), $X^\top X$ can be inverted to obtain the unique solution:

$$ w = \left( X^\top X \right)^{-1} X^\top y$$

Note: For reference, matrix calculus rules for taking the partial derivative with respect $w$ can be found on Wikipedia here. Or as rightskewed pointed out, there's the classic matrix cookbook.

Related Question