[Math] Derivative of Squared Frobenius Norm

Lets say I have an $n \times n$ matrix $a$ and I need to compute
$$
\frac{\partial}{\partial a} h(a) =
\frac{\partial}{\partial a} \left\|a\right\|_\text{fro}^2 =
\frac{\partial}{\partial a} \sum_{i,j} \left(a_{i,j}\right)^2 \text.
$$
I'm getting a result that cannot be right. What am I doing wrong? I start by rewriting the above term using $J_i$ to denote a column vector with $\left(J_i\right)_k = 0\ \forall i \neq k$ and $\left(J_i\right)_k = 1$ for $i = k$:
$$
= \sum_{i,j} \frac{\partial}{\partial a} \left(J_i^\mathrm T a J_j\right)^2
$$
Now I apply the chain rule:
$$
= \sum_{i,j} \frac{\partial}{\partial a} u(v(a))
= \sum_{i,j} u'(v(a))\ v'(a)
= \sum_{i,j} \underbrace{2 J_i^\mathrm T a J_j}_{u'(v(a))}\ \underbrace{J_i^\mathrm T J_j}_{v'(a)}
$$
And this obviously resolves to:
$$
= \sum_{i,j} 2 J_i^\mathrm T a J_j \left[i=j\right]
= \sum_{i} 2 J_i^\mathrm T a J_i
= \sum_{i} 2 a_{i,i}
= 2\ \mathrm{tr}\left(a\right)
$$
Now I think that this cannot be right because this result implies that changing a non-diagonal element of $a$ won't change the value $h(a)$. But this contradicts the definition of $h$ above.

Where's my mistake?

Best Answer

What you are doing wrong is assuming that you can apply the "product rule" and "chain rule" to matrix differentiation as you're thinking about it, as is stated in the article here.

There is a "product rule" and "chain rule" that work in this context. However, understanding them requires that you acknowledge that the derivative of $h(a)$ is not simply a scalar-valued function on matrices; rather, at each $a$, $h'(a)$ is a linear functional on matrices, which can be represented nicely as a matrix with the correct choice of dual basis.

Best Answer

Related Solutions

[Math] the derivative of ${}^xx$

Machine Learning – Derivative of Softmax Loss Function

Related Question