Solved – Reinforcement Learning – What is the logic behind actor-critic methods? Why use a critic

actor-criticmachine learningneural networkspolicy gradientreinforcement learning

Following David Silver's course, I came across the actor-critic policy improvement algorithm family.

It holds For one-step Markov decision processes that

$$\nabla_{\theta}J(\theta) = \mathbb{E}_{\pi_{\theta}}[log \pi_{\theta}(s,a)*r]$$

where $J$ is the MDP's value function, and $\pi_\theta$ is the policy parameterized by $\theta$. $r$ is the rewared sampled after taking action $a$.

It also holds that (the policy gradient theorem) for some value functions $J$:

the policy gradient is

$$\nabla_{\theta}J(\theta) = \mathbb{E}[\nabla_{\theta}log{\pi_{\theta}}(s,a)Q^{\pi_{\theta}}(s,a)]$$

David there says (1:06:35 +) "And the actor moves in the direction suggested by the critic".

I am pretty sure by that he means "the actor's weights are then updated in direct relation to the critic's criticism"

$$\theta = \theta + \alpha \nabla_{\theta}log\pi_{\theta}(s,a)Q_{W}(s,a)$$

where alpha is the learning rate, $\pi_{\theta}$ is the actor's policy, parameterized by $\theta$, and $Q_w$ is the critic's evaluation function, parameterized by $w$.

So far so good.

What I am not getting (basically many aspects of the same question):

Why do we need a critic at all?

I just can't see where the critic suddenly came from and what it solves.

What is the gradient of the policy $\pi$ itself, if not "the dirction of improvement"? Why add the critic?

Why not use the same parameters for the actor and the critic? It seems to me they are actually to approximate he same thing: "How good is choosing action a from state s"

Why did we replace $Q^{\pi_{\theta}}(s,a)$ with an approximation by different parameters $w$, $Q_w(s,a)$. What benefit does this separation introduce?

Best Answer

Why do we need a critic at all?

I just can't see where the critic suddenly came from and what it solves.

The critic solves the problem of high variance in the reward signal. If you run the same (likely-stochastic) policy over and over in an (also probably stochastic) environment, you will get different amounts of cumulative reward all the time. Meanwhile, the critic gives a (hopefully good) estimation of the cumulative reward without any variance between rollouts.

Why not use the same parameters for the actor and the critic? It seems to me they are actually to approximate he same thing: "How good is choosing action a from state s"

You can certainly share some of the parameters in the networks of the actor and the critic. However, you can't literally use the same exact parameters all the way through because they're neural networks which compute different things.

Why did we replace $Q^{\pi_{\theta}}(s,a)$ with an approximation by different parameters $w$, $Q_w(s,a)$. What benefit does this separation introduce?

$Q^{\pi_{\theta}}(s,a)$ is the Q-value of the policy $\pi_\theta$. There is no straightforward way to compute this for free. $Q_w(s,a)$ is our approximation of that function.

Related Question