[Math] How to differentiate a matrix equation with respect to a vector

derivativeslinear algebra

I'm having lots of trouble piecing this together (from Elements of Statistical Learning by Hastie/Friedman):

I don't understand the step: "[d]ifferentiating w.r.t $B$", specifically how to calculate the derivative of an equation involving matrix products and transposes with respect to a vector. Is this standard matrix calculus?

Best Answer

$\newcommand{\mat}[1]{\mathbf{#1}}$ Yes — you can solve this using the standard tools of matrix calculus.

In particular, we can use the following rules (which you can confirm componentwise):

  1. If matrix $\mat{A}$ does not depend on the entries in vector $\mat x$, then $\frac{\partial}{\partial \mat x}(\mat A\mat x)=\mat A$.
  2. If the matrices $\mat A$ and $\mat B$ may depend on the entries in vector $\mat x$, then $\frac{\partial }{\partial \mat x}(\mat A\mat B) = \frac{\partial \mat A}{\partial \mat x}\mat B + \mat A\frac{\partial \mat B}{\partial \mat x}$. This holds even in the special case that $\mat{A}$ and $\mat {B}$ are vectors.
  3. The derivative of a sum is the sum of the derivatives.
  4. The derivative of the transpose is the transpose of the derivative.
  5. The transpose of a sum is the sum of the transpose.

Hence, given that the terms $\mathbf{X}$ and $\mathbf{y}$ do not depend on the vector $\beta$, we find the following results:

  1. $\frac{\partial }{\partial \beta}(\mat y - \mat X \beta ) = -\mat X$
  2. $\frac{\partial }{\partial \beta}(\mat y - \mat X \beta )^\top = -\mat X^\top$
  3. $$\begin{align*}\frac{\partial }{\partial \beta}(\mat y - \mat X \beta )^\top(\mat y - \mat X \beta) &= -\mat X^\top (\mat y - \mat X \beta) - (\mat y - \mat X \beta)^\top \mat X\\&=-\mat X^\top \mat y + \mat X^\top\mat X\beta - \mat y^\top \mat X - \beta^\top\mat X^\top \mat X \\ &= (-\mat X^\top \mat y + \mat X^\top\mat X\beta) + (-\mat X^\top \mat y + \mat X^\top\mat X\beta)^\top \\ &= 2(-\mat X^\top \mat y + \mat X^\top \mat X\beta) &\{\alpha^\top = \alpha \text{, when }\alpha\text{ is a scalar}\}\\ &= -2\mat{X}^\top(\mat y - \mat X \beta)&\{\text{factor out }-\mat{X}^\top\} \\\end{align*}$$

And this last term is equal to zero if and only if

$$\mat X^\top (\mat y - \mat X \beta) = 0.$$

This system comprises the “normal equations” that show at what value of $\beta$ the quadratic $(\mat y - \mat X \beta)^\top(\mat y - \mat X \beta)$ has a critical point.