Matrix differentiation by a vector in Least Squares method

calculusmatrix equationsmatrix-calculus

In a book The Elements of
Statistical Learning
published by Springer we can find following statement:

We can write

$RSS(\beta) = (\mathbf{y}-\mathbf{X}\beta)^T(\mathbf{y}-\mathbf{X}\beta)$

where $\mathbf{X}$ is an $N\times p$ matrix with each row an input vector, and $\mathbf{y}$ is an $N$-vector of the outputs in the training set. Differentiating w.r.t. $\beta$ we get the normal equations

$\mathbf{X}^T(\mathbf{y}-\mathbf{X}\beta) = 0$

Questions

How do I formally derive the normal equations operating on matrix level calculations without diving into operating on scalar elements?

Is my Second attemp valid?

First attemp

Note: $\beta$ is an $p$-vector. Let us assume that vectors are vertical matrixes.

As in The Matrix Cookbook (http://www.math.uwaterloo.ca/~hwolkowi//matrixcookbook.pdf) let us assume that
$\partial\mathbf{X}^T = (\partial\mathbf{X})^T$
and
$\partial(\mathbf{XY})=\partial(\mathbf{X})\mathbf{Y}+\mathbf{X}\partial(\mathbf{Y})$.

Let us differentiate with respect to $\beta$ and observe that $\partial (\mathbf{y}-\mathbf{X}\beta)=-\mathbf{X}$.

Now $\partial RSS(\beta)=(\partial (\mathbf{y}-\mathbf{X}\beta))^T(\mathbf{y}-\mathbf{X}\beta)+(\mathbf{y}-\mathbf{X}\beta)^T \partial (\mathbf{y}-\mathbf{X}\beta)$

Which gives us $\partial RSS(\beta)=-\mathbf{X}^T (\mathbf{y}-\mathbf{X}\beta)-(\mathbf{X}^T (\mathbf{y}-\mathbf{X}\beta))^T$

At this point we find contradiction because dimensions are incompatible to perform summation. $\mathbf{X}^T (\mathbf{y}-\mathbf{X}\beta)$ is vertical $p$-vector, while $(\mathbf{X}^T (\mathbf{y}-\mathbf{X}\beta))^T$ is horizontal $p$-vector.

Second attemp

If I assumed $\partial(\mathbf{X}^T\mathbf{Y})=\partial(\mathbf{X})^T\mathbf{Y}+(\mathbf{X}^T\partial(\mathbf{Y}))^T$

I would get that
$\partial RSS(\beta)=-2\mathbf{X}^T (\mathbf{y}-\mathbf{X}\beta)$, which matches with the normal equations from the book.

Best Answer

If you have not found an example, then here it goes.


Before we start deriving the gradient, some facts and notations for brevity:

  • Trace and Frobenius product relation $$\left\langle A, B C\right\rangle={\rm tr}(A^TBC) := A : B C$$
  • Cyclic properties of Trace/Frobenius product \begin{align} A : B C &= BC : A \\ &= B^T A : C \\ &= {\text{etc.}} \cr \end{align}

Let $f := \left\|y- X\beta \right\|^2 = \left(y- X\beta \right)^T \left(y- X\beta \right) = y- X\beta:y- X\beta$.

Now, we can obtain the differential first, and then the gradient. \begin{align} df &= d\left( y- X\beta:y- X\beta \right) \\ &= 2\left(y- X\beta \right) : -X d\beta \\ &= -2X^T\left(y- X\beta\right) : d\beta\\ \end{align}

Thus, the gradient is \begin{align} \frac{\partial}{\partial \beta} \left( \left\|y - X \beta \right\|^2 \right)= -2X^T\left(y- X\beta\right). \end{align}