[Math] Derivative of matrix multiplication w.r.t. a matrix – how to write

calculusmatrix-calculusmultivariable-calculus

If I have a matrix $W$:

$\begin{bmatrix}
w_{00} & w_{01} & w_{02} \\
w_{10} & w_{11} & w_{12} \\
w_{20} & w_{21} & w_{22}\\
w_{30} & w_{31} & w_{32}
\end{bmatrix}
$

and a matrix $X$:

$\begin{bmatrix}
x_{00} & x_{01} & x_{02} & x_{03}\\
x_{10} & x_{11} & x_{12} & x_{13} \\
x_{20} & x_{21} & x_{22} & x_{23}
\end{bmatrix}
$

How do I write out the derivative of $Z=XW$ w.r.t. the matrix $W$?

$Z$ I know is (3×3):

$\begin{bmatrix}
z_{00} & z_{01} & z_{02} \\
z_{10} & z_{11} & z_{12} \\
z_{20} & z_{21} & z_{22}
\end{bmatrix}
$

Since $Z$ is (3×3) is $\frac{\partial{Z}}{\partial{W}}$ a (9×12) or (12×9) matrix?

Best Answer

I think it is more appropriate in this case to work exclusively in matrix notation. Let me explain.

You have a function $f : \mathrm{Mat}_{n \times p}(\mathbb R) \times \mathrm{Mat}_{p \times m}(\mathbb R) \to \mathrm{Mat}_{n \times m}(\mathbb R)$ sending a pair of matrices $(X,Y)$ to their product $f(X,Y) \overset{def}=XY$. In terms of differential geometry, if we are given a "point" in $\mathrm{Mat}_{n \times p}(\mathbb R) \times \mathrm{Mat}_{p \times m}(\mathbb R)$ (i.e. two matrices), the tangent space is canonically isomorphic to the space itself (since it is a linear manifold) and tangent vectors are just pairs of matrices. We also have a canonical basis consisting of the $(E_{i,j},0)$ and $(0, E_{k,\ell})$ where the indices $(i,j)$ range over $(1,1),\cdots,(n,p)$ and similarly, the indices $(k,\ell)$ range over $(1,1),\cdots,(p,m)$. Using the standard definition of directional derivative, $$ \frac{\partial f}{\partial (E_{i,j},0)} = \lim_{\varepsilon \to 0} \frac{(X+\varepsilon E_{i,j})Y - XY}{\varepsilon} = \lim_{\varepsilon \to 0} \frac{\varepsilon E_{i,j}Y}{\varepsilon} = E_{i,j}Y. $$ (Feel free to skip the differential geometry blabla if you agree with the latter equation.) Similarly, you can deduce that $\frac{\partial f}{\partial (0,E_{k,l})} = XE_{k,l}$.

In the same way that the Jacobian matrix of a function $g : \mathbb R^n \to \mathbb R^m$ gives you an $m \times n$-matrix, the Jacobian matrix of the function $f$ gives us an $(np^2m) \times nm$ matrix, something quite discouraging. To enlighten us, we use the fact that our function $f$ is quadratic in the coefficients of $X$ and $Y$. Let us use the following formula to compute the "Taylor expansion" of this function at a pair of matrices $(X_0,Y_0)$: $$ XY - X_0Y_0 = (X-X_0 + X_0)(Y-Y_0 + Y_0) - X_0 Y_0 \\ = \underset{\text{Jacobian (linear) term}}{\underbrace{(X-X_0)Y_0 + X_0 (Y-Y_0)}} + \underset{\text{Hessian (quadratic) term}}{\underbrace{\frac 12 \left( \phantom{\int}\hspace{-9 pt}2(X-X_0)(Y-Y_0) \right)}} $$ This suggests that $J_f(X_0,Y_0)(X,Y) = XY_0 + X_0Y$ and $H_f(X_0,Y_0)((X,Y),(X,Y)) = 2XY$ (note that we need two pairs of matrices as arguments since the Hessian is a quadratic form, and we kept the arguments equal for the moment!). This is simply an application of the standard formula for vectors $x_0,x \in \mathbb R^n$ where $g : \mathbb R^n \to \mathbb R^m$ $$ g(x) = g(x_0) + J_g(x_0)(x-x_0) + \frac 12 H_g(x_0)(x-x_0,x-x_0) $$ The tensor here $H_g(x_0)$ is of order $3$ ; think of each coordinate of $\mathbb R^m$ has having its own Hessian matrix, and $H_g$ is those $m$ matrices patched together. If for some reason you are interested in the Hessian, note that for vectors $x_0,x, x' \in \mathbb R^n$ where $g : \mathbb R^n \to \mathbb R^m$ $$ \frac 12 H_g(x_0)(x,x') = \sum_{i,j=1}^n x_i x_j' \frac{\partial^2 g}{\partial x_i \partial x_j}(x_0) $$ so if we repeat the same idea but for our function, we get the formula $$ \frac 12 H_f(X_0,Y_0)((X,Y),(X',Y')) = \sum_{(i,j),(k',\ell')} x_{ij} y'_{k',\ell'} E_{ij} E_{k',\ell'} = \left( \sum_{(i,j)} x_{ij} E_{ij} \right) \left( \sum_{(k',\ell')} y'_{k',\ell'} E_{k',\ell'} \right) = XY'. $$ The reason why those are the only terms appearing in the sum is because for the other ones, the partial derivatives of the second order of $f$ vanish. Multiplying the above by $2$ generalizes our formula obtained via Taylor expansion (because we only had dealt with the case where $(X,Y) = (X',Y')$). In particular, the Hessian is a constant tensor of total order $3$ (i.e. does not depend on $X_0$ or $Y_0$), which is a characteristic property of quadratic functions.

To understand what "of total order $3$" means, consider this idea : if you have a vector, taking inner products with a vector of same length (a tensor of order $1$) gives you a number. If you have a matrix, taking the products with two vectors, one for each dimension of the matrix, you get back a scalar. In the case of our above Hessian, taking two vectors $(X,Y)$ and $(X',Y')$ and taking the appropriate products gives $XY'$, another vector (in the vector space $\mathrm{Mat}_{n \times m}(\mathbb R)$). See this for more on tensors.

But now that this trick is dealt with, back to the original question : what is $\frac{\partial XY}{\partial X}(X_0,Y_0)$? Come back for a moment to the definition of directional derivative. What we do in this context is consider a function $g(x,y)$ of two variables as a function of a single variable $x$ to evaluate the partial derivative at $(x_0,y_0)$. We can do the same here using the concept of Fréchet derivative instead! This also means that $\frac{\partial f}{\partial X}$ is not a number or a matrix, but a linear operator on the space of matrices corresponding to $X$. For instance, $f(X,Y) = XY$ satisfies $\frac{\partial f}{\partial X}(X_0,Y_0)(X) = XY_0$ since $$ \lim_{Z \to 0} \frac{\|(X_0+Z)Y_0 - X_0Y_0 - \frac{\partial f}{\partial X}(X_0,Y_0)(Z)\|}{\|Z\|} = \lim_{Z \to 0} \frac{\|(X_0+Z)Y_0 - X_0Y_0 - ZY_0\|}{\|Z\|} = 0. $$ (The latter is true under any choice of matrix norms.) Similarly, $\frac{\partial f}{\partial Y}(X_0,Y_0)(Y) = X_0Y$.

You can still apply the chain rule with this partial derivative, but you need to worry~; when you had a composition of functions, you multiplied the Jacobian matrices before. In this case, you need to compose the linear operators, so this might mean something a bit different in the context. For instance, if $X,Y$ are both functions of a real variable $t$, then $$ \frac{\partial X(t)Y(t)}{\partial t}(t_0) = \frac{\partial XY}{\partial X}(X(t_0),Y(t_0)) \left( \frac{\partial X(t)}{\partial t}(t_0) \right) + \frac{\partial XY}{\partial Y}(X(t_0),Y(t_0)) \left( \frac{\partial Y(t)}{\partial t}(t_0) \right) \\ = X'(t) Y(t) + X(t) Y'(t). $$ (Note that this is what you would expect!) As an exercise, using this answer, you could prove that if $X_0,Y_0$ are matrices such that $X_0Y_0$ is a well-defined square invertible matrix, then $$ \frac{\partial \det(XY)}{\partial X}(X_0,Y_0)(X) = \mathrm{tr}(\mathrm{adj}(X_0Y_0)(XY_0)) = \det(X_0Y_0) \, \mathrm{tr}( (XY_0)(X_0Y_0)^{-1}). $$

Hope that helps,