Solved – Computing the Actor Gradient Update in the Deep Deterministic Policy Gradient (DDPG) algorithm

deep learningmachine learningneural networksreinforcement learning

This question is in regards to the Deepmind paper on DDPG: https://arxiv.org/pdf/1509.02971v5.pdf.

Most (all?) implementations of the DDPG algorithm that I've seen compute the gradient update to the actor network by $\nabla(J)=\nabla_{\mu(s|\theta)}(Q(s,\mu(s|\theta))\nabla_{\theta}(\mu(s|\theta))$, where $\theta$ represents the actor network's parameters, $\mu$ represents the actor network, $Q$ repesents the critic network, and $s$ represents the state input. I'll call this equation 1.

Equation 1, as is shown in the paper, is derived by applying the chain rule to $\nabla(J)=\nabla_{\theta}(Q(s,\mu(s|\theta))$. This gives $ \nabla_{\mu(s|\theta)}(Q(s,\mu(s|\theta))\nabla_{\theta}(\mu(s|\theta))$.

My question is, using an auto-grad software package (Theano/Tensorflow/Torch/etc), is there any reason why I couldn't just compute the gradient of the output of $Q$ wrt $\theta$ directly? For some reason, all implementations seem to first compute the gradient of the output of $Q$ wrt $\mu(s)$ and then multiply it by the gradient of $\mu(s)$ wrt to $\theta$, per the chain rule. I don't understand why they do this–why not just directly compute the gradient of $Q$ wrt $\theta$ instead? Is there a reason you cannot do this?
I.e, why do most updates seem to do this:

Q_grad = gradients( Q(s, mu(s|theta)), mu(s|theta) )
mu_grad = gradients( mu(s|theta), theta )
J_grad = Q_grad * mu_grad

Instead of this:

J_grad = gradients( Q(s, mu(s|theta)), theta )

Where the first input to "gradients" is the function you want to differentiate and the second input is what you are differentiating with respect to.

To be clear, I see no reason why $\nabla(J)=\nabla_{\theta}(Q)$ is a different update than equation 1, seeing as equation 1 is literally derived by applying the chain rule to $\nabla_{\theta}(Q)$, but I want to make sure I'm not missing some kind of subtlety.

Best Answer

There is no difference in the calculation. I was wondering the same thing and verified in my own TensorFlow DDPG implementation by trying both and asserting that the numerical values are identical. As expected, they are.

I noticed that most tutorial-like implementations (e.g. Patrick Emami's) explicitly show the multiplication. However, OpenAI's baselines implementation $does$ directly compute $\nabla_{\theta^\mu} Q$. (They do this by defining a loss on the actor network equal to $-\nabla_{\theta^\mu} Q$, averaged across the batch).

There is one reason that you'd want to separate out $\nabla_a Q$ from $\nabla_{\theta^\mu} \mu$ and multiply them. This is if you want to directly manipulate one of the terms. For example, Hausknecht and Stone do "inverting gradients" on $\nabla_a Q$ to coerce actions to stay within the environment's range.

Related Question