[Math] Derivative of a particular matrix valued function with respect to a vector

derivatives

I am reading a section of a book regarding linear regression and came across a derivation that I could not follow.

It starts with a loss function:

$\mathcal{L}(\textbf{w},S) = (\textbf{y}-\textbf{X}\textbf{w})^\top(\textbf{y}-\textbf{X}\textbf{w})$

and then states that "We can seek the optimal $\textbf{w}$ by taking the derivatives of the loss with respect to $\textbf{w}$ and setting them to the zero vector"

$\frac{\partial\mathcal{L}(\textbf{w},S)}{\partial\textbf{w}} = -2\textbf{X}^{\top}\textbf{y} + 2\textbf{X}^\top\textbf{X}\textbf{w} = \textbf{0}$

How is this derivative being calculated? I find that I have no idea how to take the derivative of vector or matrix valued functions, especially when the derivative is with respect to a vector, however I found a pdf ( http://orion.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf ) which appears to address some of my questions, yet my attempts at taking the derivative of the loss function seem to be missing a transpose and thus does not reduce as nicely as the books result.

Best Answer

The definition of the derivative can be found at http://en.wikipedia.org/wiki/Fr%C3%A9chet_derivative.

In this case, the derivative can be computed directly by expanding the function: $$\mathcal{L}(w+\delta,S)= \langle y -X(w+\delta), y -X(w + \delta) \rangle = \mathcal{L}(w,S)+2 \langle y -Xw,-X\delta\rangle+ || X \delta||^2.$$ The second term can be written as $2 \langle -X^T(y -Xw),\delta\rangle = \langle -2X^Ty +2X^TXw,\delta\rangle$, from which it follows that the (Fréchet) derivative is $ \frac{\partial \mathcal{L}(w,S)}{\partial w} = (-2X^Ty +2X^TXw)^T$.

The derivative can also be computed componentwise, but requires more bookkeeping.

The expression you have for the partial is missing a transpose.

Related Question