Solved – Computing Empirical Fisher Information matrix for natural gradient

fisher informationpolicy gradientreinforcement learning

I would like to implement the natural gradient for reinforcement learning as described in the following paper: https://arxiv.org/pdf/1703.02660.pdf

However, I do not know how to compute the empirical Fisher Information matrix to implement gradient ascent with the following parameter update $\theta_{t+1} := \theta_t + F^{-1}\nabla_\theta J(\pi_\theta)$, where $\nabla_\theta J(\pi_\theta)$ is the regular policy gradient weighted by the advantages.

When computing the empirical Fisher Information as the outer product of the log policy gradient $F = \frac{1}{T}\sum_{t=1}^T\nabla_{\theta}\, log\, \pi_\theta\, \nabla_{\theta}\, log\, \pi_\theta^\intercal$ (summed over all trajectories/samples), the resulting Fisher matrix is not positive semidefinite.

I do not see any reason why the policy gradient $\nabla_{\theta}\, log\, \pi\,$ should not have negative components.

What is the right way to practically calculate the empirical Fisher information from the gradient in an implementation? Is it correct to directly use the outer product of the gradients (e.g. with numpy as F = np.outer(grad, grad)?

Best Answer

I found an answer to my question here: http://www.telesens.co/2018/06/09/efficiently-computing-the-fisher-vector-product-in-trpo/

The second derivative of the KL divergence can be used, but there are better ways to approximate the Natural Gradient like TRPO.

Related Question