In a book The Elements of
Statistical Learning published by Springer we can find following statement:
We can write
$RSS(\beta) = (\mathbf{y}-\mathbf{X}\beta)^T(\mathbf{y}-\mathbf{X}\beta)$
where $\mathbf{X}$ is an $N\times p$ matrix with each row an input vector, and $\mathbf{y}$ is an $N$-vector of the outputs in the training set. Differentiating w.r.t. $\beta$ we get the normal equations
$\mathbf{X}^T(\mathbf{y}-\mathbf{X}\beta) = 0$
Questions
How do I formally derive the normal equations operating on matrix level calculations without diving into operating on scalar elements?
Is my Second attemp valid?
First attemp
Note: $\beta$ is an $p$-vector. Let us assume that vectors are vertical matrixes.
As in The Matrix Cookbook (http://www.math.uwaterloo.ca/~hwolkowi//matrixcookbook.pdf) let us assume that
$\partial\mathbf{X}^T = (\partial\mathbf{X})^T$
and
$\partial(\mathbf{XY})=\partial(\mathbf{X})\mathbf{Y}+\mathbf{X}\partial(\mathbf{Y})$.
Let us differentiate with respect to $\beta$ and observe that $\partial (\mathbf{y}-\mathbf{X}\beta)=-\mathbf{X}$.
Now $\partial RSS(\beta)=(\partial (\mathbf{y}-\mathbf{X}\beta))^T(\mathbf{y}-\mathbf{X}\beta)+(\mathbf{y}-\mathbf{X}\beta)^T \partial (\mathbf{y}-\mathbf{X}\beta)$
Which gives us $\partial RSS(\beta)=-\mathbf{X}^T (\mathbf{y}-\mathbf{X}\beta)-(\mathbf{X}^T (\mathbf{y}-\mathbf{X}\beta))^T$
At this point we find contradiction because dimensions are incompatible to perform summation. $\mathbf{X}^T (\mathbf{y}-\mathbf{X}\beta)$ is vertical $p$-vector, while $(\mathbf{X}^T (\mathbf{y}-\mathbf{X}\beta))^T$ is horizontal $p$-vector.
Second attemp
If I assumed $\partial(\mathbf{X}^T\mathbf{Y})=\partial(\mathbf{X})^T\mathbf{Y}+(\mathbf{X}^T\partial(\mathbf{Y}))^T$
I would get that
$\partial RSS(\beta)=-2\mathbf{X}^T (\mathbf{y}-\mathbf{X}\beta)$, which matches with the normal equations from the book.
Best Answer
If you have not found an example, then here it goes.
Before we start deriving the gradient, some facts and notations for brevity:
Let $f := \left\|y- X\beta \right\|^2 = \left(y- X\beta \right)^T \left(y- X\beta \right) = y- X\beta:y- X\beta$.
Now, we can obtain the differential first, and then the gradient. \begin{align} df &= d\left( y- X\beta:y- X\beta \right) \\ &= 2\left(y- X\beta \right) : -X d\beta \\ &= -2X^T\left(y- X\beta\right) : d\beta\\ \end{align}
Thus, the gradient is \begin{align} \frac{\partial}{\partial \beta} \left( \left\|y - X \beta \right\|^2 \right)= -2X^T\left(y- X\beta\right). \end{align}