Matrix Differentiation of Generative Neural Network

derivativesmatrix-calculusneural networksoptimization

I wish to construct a neural network (restricted Boltzmann Machine) with network parameter set $\Omega$ that will minimise the cost function,
\begin{equation}
\mathcal{C}(\Omega) = -\text{Tr}\big[A\log(B_\Omega)\big]
\end{equation}
where $B_\Omega$ is a Hermitian, postive semi-definite matrix generated by the neural network, and $A$ is a fixed Hermitian, postive semi-definite matrix.

Due to the $\log(\cdot)$, this appears very tricky. How do I take derivates of this cost function with respect to an arbitrary parameter $\Omega_i$ of the network? Is this out of reach analytically and perhaps automatic differentiation is a better approach?

Best Answer

Differentiating the matrix logarithm is a big challenge, but there are a couple of approaches at your disposal that could be useful. One lemma comes from a paper by Moakher in which the derivative of a single-variable matrix function $X(t)$ [as long as $X$ is invertible with positive eigenvalues] has the following form:

Let $X(t)$ be a real matrix-valued function of the real variable $t$. We assume that, for all $t$ in its domain, $X(t)$ is an invertible matrix which does not have eigenvalues on the closed negative real line. Then. $$ \frac{d}{dt} \frac{1}{2}\text{Tr} \left (\text{Log}^2(X(t)) \right ) \;\; =\;\; \text{Tr}\left (\text{Log}(X(t)) \cdot X^{-1}(t) \cdot \dot{X}(t)\right ). $$

Unfortunately, this doesn't seem to be your objective function that you're working with. Another way around this would be to truncate your logarithm to some Taylor expansion. I imagine this would be warranted since you're dealing with an algorithm and a closed-form expression may not be necessary for your applications. Using $$ \text{Log}(X) \;\; =\;\; \sum_{k=1}^\infty \frac{(-1)^{k+1}}{k} (X - I)^k \;\; =\;\; (X-I) - \frac{1}{2}(X-I)^2 + \frac{1}{3}(X-I)^3 + \ldots $$

Your objective function can be rewritten as

$$ -\text{Tr}\left (A\text{Log}(B_\Omega)\right ) \;\; =\;\; \sum_{k=1}^\infty \frac{(-1)^k}{k} \text{Tr}\left [A (B_\Omega - I)^k\right ]. $$

The one issue here is that the logarithm is only guaranteed to converge absolutely if $||B_\Omega|| < 1$, but for the sake of computation you can truncate this series expansion to whatever precision you want/need and take derivatives with respect to these terms instead.

Best Answer

Related Solutions

[Math] Neural Network – Why use Derivative

D(vector)/d(matrix) in neural-network-deep-learning context

Related Question