Derivative of a matrix divided by its Frobenius norm

derivativesmatricesmatrix-calculus

I have a function $f$ that takes $\textbf{X}\in \mathbb{R}^{m \times n}$ as input and return $\textbf{Y} \in \mathbb{R}^{m \times n}$ matrix as output.

$$
\textbf{Y}=f(\textbf{X}) = \textbf{X} \frac{1}{||\textbf{X}||_F}
$$

where $||\textbf{X}||_F \in \mathbb{R}$ is the Frobenius norm of matrix $\textbf{X}$. What is the derivative of $\textbf{Y}$ with respect to $\textbf{X}$, i.e., what is $\frac{\partial \textbf{Y}}{\partial \textbf{X}}$?


This is my attempt:

First, set $a=||\textbf{X}||_F=\sqrt{\sum^m_{i=1}\sum^n_{j=1}x_{ij}^2}$, then tried to calculate $\frac{\partial a}{\partial \textbf{X}}$ myself as follows
$$
\frac{\partial a}{\partial x_{ij}} = \frac{1}{2\sqrt{\sum^m_{i=1}\sum^n_{j=1}x_{ij}^2}}\cdot 2x_{ij}=\frac{1}{a} x_{ij}
\Rightarrow \frac{\partial a}{\partial \textbf{X}} = \frac{1}{a}\textbf{X}
$$

Then calculate $\frac{\partial \textbf{Y}}{\partial a}$ as follows
$$
\textbf{Y} = \textbf{X} \frac{1}{||\textbf{X}||_F} = \frac{1}{a} \textbf{X}
\Rightarrow
\frac{\partial \textbf{Y}}{\partial a} = -\frac{1}{a^2}\textbf{X}
$$

I think
$\frac{\partial \textbf{Y}}{\partial \textbf{X}} \in \mathbb{R}^{m \times n}$. Using chain rule:
$$
\frac{\partial \textbf{Y}}{\partial \textbf{X}}= \frac{1}{a} + \frac{\partial \textbf{Y}}{\partial a} \frac{\partial a}{\partial \textbf{X}} = \frac{1}{a} + \left(-\frac{1}{a^2}\textbf{X}\right)\left(\frac{1}{a}\textbf{X}\right) = \frac{1}{a} – \frac{1}{a^3}\left(\textbf{X} \textbf{X} \right) \ \ \ \text{ <– I think this is wrong}
$$

I think I am doing incorrectly as $\textbf{X}\in \mathbb{R}^{m \times n}$ is a non-square matrix, and $\frac{\partial \textbf{Y}}{\partial \textbf{X}} \in \mathbb{R}^{m \times n}$.


How should I calculate $\frac{\partial a}{\partial \textbf{X}}$ and $\frac{\partial \textbf{Y}}{\partial \textbf{X}}$?

Best Answer

$ \def\l{\lambda} \def\d{\delta} \def\L#1{\l^{-#1}} \def\o{{\tt1}} \def\p{\partial} \def\E{{\cal E}} \def\F{{\cal F}} \def\G{{\cal G}} \def\H{{\cal H}} \def\e{\epsilon} \def\f{\phi} \def\g{\gamma} \def\h{\eta} \def\LR#1{\left(#1\right)} \def\op#1{\operatorname{#1}} \def\vc#1{\op{vec}\LR{#1}} \def\trace#1{\op{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\qif{\quad\iff\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} $A matrix-by-matrix gradient is a fourth-order tensor, so the use of tensors is unavoidable.
Towards that end, introduce the Frobenius $(:)$ and dyadic $(\star)$ products $$\eqalign{ \f &= G:H \qiq \f = \sum_{i=1}^m\sum_{j=1}^n G_{ij}\,H_{ij} \\ F &= G:\H \qiq F_{kl} = \sum_{i=1}^m\sum_{j=1}^n G_{ij}\,\H_{ijkl} \\ F &= \G:H \qiq F_{ij} = \sum_{k=1}^m\sum_{l=1}^n \G_{ijkl}\,H_{kl} \\ \F &= G\star H \qiq \F_{ijkl} = G_{ij}H_{kl} \\ }$$ and an identity tensor $\E$ which can be defined in terms of Kronecker deltas $$\eqalign{ \E_{ijkl} &= \d_{ik}\d_{jl} = \begin{cases} \o\quad{\rm if}\;\;i=k\;\;{\rm and}\;\;j=l \\ 0\quad{\rm otherwise} \\ \end{cases} \\ \E:B &= B:\E = B \quad \big({\rm identity\:relation}\big) \\ }$$ Next, we need to differentiate the Frobenius norm $$\eqalign{ \l &= \|X\|_F \\ \l^2 &= \|X\|^2_F \;=\; X:X \\ 2\l\:d\l &= 2X:dX \\ {d\l} &= \L1X:dX \\ }$$ Then we need to differentiate $Y$ $$\eqalign{ Y &= \L1X \\ dY &= \L1dX - \L2X\,\c{d\l} \\ &= \L1\E:dX - \L2X\LR{\c{\L1X:dX}} \\ &= \L1\LR{\E-Y\star Y}:dX \\ \grad{Y}{X} &= \L1\LR{\E-Y\star Y} \\ }$$ As expected the gradient is a fourth-order tensor.
Translating this into index notation $$\eqalign{ \grad{Y_{ij}}{X_{kl}} \;=\; \frac{\E_{ijkl}-Y_{ij}Y_{kl}}\l \;=\; \frac{\d_{ik}\d_{jl}-Y_{ij}Y_{kl}}\l \\ \\ }$$


Another approach is to flatten the matrices into vectors $$\eqalign{ y &= \vc Y,\quad x = \vc X,\quad \l = \|x\|_F \\ \grad{y}{x} &= \frac{I-yy^T}\l \qif \grad{y_i}{x_k} = \frac{\d_{ik}-y_iy_k}\l \\ }$$

Related Question