Derivative of Matrix with respect to a Matrix

linear algebramatrix-calculusmultivariable-calculus

I want to calculate the derivative of dot product of two Matrices of not the same order.

$X = \begin{bmatrix}x_{11} & x_{12} & x_{13}\\x_{21} & x_{22} & x_{23}\\x_{31} & x_{32} & x_{32}\end{bmatrix}$

$y= \begin{bmatrix}y_{11} & y_{12}\\y_{21} & y_{22}\\y_{31} & y_{32}\end{bmatrix}$

Problem is I can't figure out What does it mean to derivative of matrix with respect of matrix individual elements.
I tried to use the sum notation to calculate derivative of a single element of the resultant matrix.

$c_{i,j} = \sum_{k=1}^na_{i,k}\cdot b_{k,j}$

$\frac{\partial (X y)_{11}}{\partial X} =
\begin{bmatrix}y_{11} & y_{12} & y_{21}\\ 0 & 0 & 0 \\ 0 & 0 & 0\end{bmatrix}$

and the other partial Derivatives are similar to this.
I want to know that what is

$\frac{\partial Xy}{\partial X} = ?$

I can't figure out how to get this when the element derivative itself is a matrix.
and the matrices as inputs are not even of the same order.

Best Answer

Since $$Xy = \mathrm{vec}(Xy) = \mathrm{vec}(IXy) = (y\otimes I)'\mathrm{vec}(X)$$ take the derivative wrt $\mathrm{vec}(X)$ to obtain $y\otimes I$. This is consistent with the comment of Ben Grossmann as it is the "vectorization" of said fourth order tensor.

Related Solutions

[Math] Derivative of matrix multiplication w.r.t. a matrix – how to write

I think it is more appropriate in this case to work exclusively in matrix notation. Let me explain.

You have a function $f : \mathrm{Mat}_{n \times p}(\mathbb R) \times \mathrm{Mat}_{p \times m}(\mathbb R) \to \mathrm{Mat}_{n \times m}(\mathbb R)$ sending a pair of matrices $(X,Y)$ to their product $f(X,Y) \overset{def}=XY$. In terms of differential geometry, if we are given a "point" in $\mathrm{Mat}_{n \times p}(\mathbb R) \times \mathrm{Mat}_{p \times m}(\mathbb R)$ (i.e. two matrices), the tangent space is canonically isomorphic to the space itself (since it is a linear manifold) and tangent vectors are just pairs of matrices. We also have a canonical basis consisting of the $(E_{i,j},0)$ and $(0, E_{k,\ell})$ where the indices $(i,j)$ range over $(1,1),\cdots,(n,p)$ and similarly, the indices $(k,\ell)$ range over $(1,1),\cdots,(p,m)$. Using the standard definition of directional derivative, $$ \frac{\partial f}{\partial (E_{i,j},0)} = \lim_{\varepsilon \to 0} \frac{(X+\varepsilon E_{i,j})Y - XY}{\varepsilon} = \lim_{\varepsilon \to 0} \frac{\varepsilon E_{i,j}Y}{\varepsilon} = E_{i,j}Y. $$ (Feel free to skip the differential geometry blabla if you agree with the latter equation.) Similarly, you can deduce that $\frac{\partial f}{\partial (0,E_{k,l})} = XE_{k,l}$.

In the same way that the Jacobian matrix of a function $g : \mathbb R^n \to \mathbb R^m$ gives you an $m \times n$-matrix, the Jacobian matrix of the function $f$ gives us an $(np^2m) \times nm$ matrix, something quite discouraging. To enlighten us, we use the fact that our function $f$ is quadratic in the coefficients of $X$ and $Y$. Let us use the following formula to compute the "Taylor expansion" of this function at a pair of matrices $(X_0,Y_0)$: $$ XY - X_0Y_0 = (X-X_0 + X_0)(Y-Y_0 + Y_0) - X_0 Y_0 \\ = \underset{\text{Jacobian (linear) term}}{\underbrace{(X-X_0)Y_0 + X_0 (Y-Y_0)}} + \underset{\text{Hessian (quadratic) term}}{\underbrace{\frac 12 \left( \phantom{\int}\hspace{-9 pt}2(X-X_0)(Y-Y_0) \right)}} $$ This suggests that $J_f(X_0,Y_0)(X,Y) = XY_0 + X_0Y$ and $H_f(X_0,Y_0)((X,Y),(X,Y)) = 2XY$ (note that we need two pairs of matrices as arguments since the Hessian is a quadratic form, and we kept the arguments equal for the moment!). This is simply an application of the standard formula for vectors $x_0,x \in \mathbb R^n$ where $g : \mathbb R^n \to \mathbb R^m$ $$ g(x) = g(x_0) + J_g(x_0)(x-x_0) + \frac 12 H_g(x_0)(x-x_0,x-x_0) $$ The tensor here $H_g(x_0)$ is of order $3$ ; think of each coordinate of $\mathbb R^m$ has having its own Hessian matrix, and $H_g$ is those $m$ matrices patched together. If for some reason you are interested in the Hessian, note that for vectors $x_0,x, x' \in \mathbb R^n$ where $g : \mathbb R^n \to \mathbb R^m$ $$ \frac 12 H_g(x_0)(x,x') = \sum_{i,j=1}^n x_i x_j' \frac{\partial^2 g}{\partial x_i \partial x_j}(x_0) $$ so if we repeat the same idea but for our function, we get the formula $$ \frac 12 H_f(X_0,Y_0)((X,Y),(X',Y')) = \sum_{(i,j),(k',\ell')} x_{ij} y'_{k',\ell'} E_{ij} E_{k',\ell'} = \left( \sum_{(i,j)} x_{ij} E_{ij} \right) \left( \sum_{(k',\ell')} y'_{k',\ell'} E_{k',\ell'} \right) = XY'. $$ The reason why those are the only terms appearing in the sum is because for the other ones, the partial derivatives of the second order of $f$ vanish. Multiplying the above by $2$ generalizes our formula obtained via Taylor expansion (because we only had dealt with the case where $(X,Y) = (X',Y')$). In particular, the Hessian is a constant tensor of total order $3$ (i.e. does not depend on $X_0$ or $Y_0$), which is a characteristic property of quadratic functions.

To understand what "of total order $3$" means, consider this idea : if you have a vector, taking inner products with a vector of same length (a tensor of order $1$) gives you a number. If you have a matrix, taking the products with two vectors, one for each dimension of the matrix, you get back a scalar. In the case of our above Hessian, taking two vectors $(X,Y)$ and $(X',Y')$ and taking the appropriate products gives $XY'$, another vector (in the vector space $\mathrm{Mat}_{n \times m}(\mathbb R)$). See this for more on tensors.

But now that this trick is dealt with, back to the original question : what is $\frac{\partial XY}{\partial X}(X_0,Y_0)$? Come back for a moment to the definition of directional derivative. What we do in this context is consider a function $g(x,y)$ of two variables as a function of a single variable $x$ to evaluate the partial derivative at $(x_0,y_0)$. We can do the same here using the concept of Fréchet derivative instead! This also means that $\frac{\partial f}{\partial X}$ is not a number or a matrix, but a linear operator on the space of matrices corresponding to $X$. For instance, $f(X,Y) = XY$ satisfies $\frac{\partial f}{\partial X}(X_0,Y_0)(X) = XY_0$ since $$ \lim_{Z \to 0} \frac{\|(X_0+Z)Y_0 - X_0Y_0 - \frac{\partial f}{\partial X}(X_0,Y_0)(Z)\|}{\|Z\|} = \lim_{Z \to 0} \frac{\|(X_0+Z)Y_0 - X_0Y_0 - ZY_0\|}{\|Z\|} = 0. $$ (The latter is true under any choice of matrix norms.) Similarly, $\frac{\partial f}{\partial Y}(X_0,Y_0)(Y) = X_0Y$.

You can still apply the chain rule with this partial derivative, but you need to worry~; when you had a composition of functions, you multiplied the Jacobian matrices before. In this case, you need to compose the linear operators, so this might mean something a bit different in the context. For instance, if $X,Y$ are both functions of a real variable $t$, then $$ \frac{\partial X(t)Y(t)}{\partial t}(t_0) = \frac{\partial XY}{\partial X}(X(t_0),Y(t_0)) \left( \frac{\partial X(t)}{\partial t}(t_0) \right) + \frac{\partial XY}{\partial Y}(X(t_0),Y(t_0)) \left( \frac{\partial Y(t)}{\partial t}(t_0) \right) \\ = X'(t) Y(t) + X(t) Y'(t). $$ (Note that this is what you would expect!) As an exercise, using this answer, you could prove that if $X_0,Y_0$ are matrices such that $X_0Y_0$ is a well-defined square invertible matrix, then $$ \frac{\partial \det(XY)}{\partial X}(X_0,Y_0)(X) = \mathrm{tr}(\mathrm{adj}(X_0Y_0)(XY_0)) = \det(X_0Y_0) \, \mathrm{tr}( (XY_0)(X_0Y_0)^{-1}). $$

Hope that helps,

Confusion about matrix derivative/chain rule

For a scalar argument, the derivatives/differentials of the function are $$\eqalign{ \sigma'(z_k) &= \frac{d\sigma}{dz_k}\quad\implies\quad d\sigma = \sigma'dz_k \\ \sigma''(z_k) &= \frac{d\sigma'}{dz_k}\quad\implies\quad d\sigma' = \sigma''dz_k \\ }$$ When applied element-wise to vectors, these evaluate to vectors $$s = \sigma(z),\quad s'=\sigma'(z),\quad s''=\sigma''(z)$$ and the differentials must be written using the element-wise (aka Hadamard) product $$ds = s'\odot dz,\qquad ds' = s''\odot dz$$ Since the time derivatives/differentials of $z$ are related by $$dz = \dot z dt,\qquad d\dot z = \ddot z dt$$ the time derivatives of $s$ can be calculated as $$\eqalign{ ds &= s'\odot\dot z \, dt \\ \dot s &= s'\odot\dot z \\\\ d\dot s &= ds'\odot\dot z + s'\odot d\dot z \\ &= (s''\odot dz)\odot\dot z + s'\odot(\ddot z dt) \\ &= (s''\odot\dot z\odot\dot z + s'\odot\ddot z)\,dt \\ \ddot s &= s''\odot\dot z\odot\dot z + s'\odot\ddot z \\ }$$ One final trick, is that Hadamard multiplication by a vector can be replaced by first converting the vector into a diagonal matrix and then performing normal matrix multiplication, i.e. $$\eqalign{ a\odot b &= {\rm Diag}(a)\,b \;=\; Ab \\ {\rm Diag}(a\odot b) &= AB = BA \quad \big({\rm diagonal\,matrices\,commute}\big) \\ }$$ So these results can be written using diagonal matrices as $$\eqalign{ \dot S &= S'\dot Z \\ \ddot S &= S''\dot Z^2 + S'\ddot Z \\ }$$ or as a mixture of matrices and vectors $$\eqalign{ \dot s &= \dot Zs' \\ \ddot s &= \dot Z^2s'' + \ddot Zs' \\ }$$ NB: In the case of the Logistic function the derivatives are given by simple formulas $$\eqalign{ \sigma' &= (\sigma-\sigma^2) &\implies S'=(S-S^2) &\implies s'=(I-S)s \\ \sigma'' &= (\sigma-3\sigma^2+2\sigma^3)\; &\implies S''=(S-3S^2+2S^3)\; &\implies s''=(I-3S+2S^2)s \\ }$$

Best Answer

Related Solutions

[Math] Derivative of matrix multiplication w.r.t. a matrix – how to write

Confusion about matrix derivative/chain rule

Related Question