Solved – Actor-critic loss function in reinforcement learning

actor-criticmachine learningreinforcement learning

In actor-critic learning for reinforcement learning, I understand you have an "actor" which is deciding the action to take, and a "critic" that then evaluates those actions, however, I'm confused on what the loss function is actually telling me.

In Sutton and Barton's book page 274 (292 of the pdf) found here http://ufal.mff.cuni.cz/~straka/courses/npfl114/2016/sutton-bookdraft2016sep.pdf

they describe the algorithm.

I can understand that you want to update the actor by incorporating information about the state-value (determined by the critic). This is done through the value of $\delta$ which incorporates said information, but I don't quite understand why it's looking at the gradient of the state-value function?

Shouldn't I be looking at the gradient of some objective function that I'm looking to minimize? Earlier in the chapter he states that we can regard performance of the policy simply as its value function, in which case is all we are doing just adjusting the parameters in the direction which maximizes the value of each state? I thought that that was supposed to be done by adjusting the policy, not by changing how we evaluate a state.

Thanks

Best Answer

Let's first try to build a solid understanding of what $\delta$ means. Maybe you know all of this, but it's good to go over it anyway in my opinion.

$\delta \gets R + \gamma \hat{v}(S', w) - \hat{v}(S, w)$

Let's start with the $\hat{v}(S, w)$ term. That term is the value of being in state $S$, as estimated by the critic under the current parameterization $w$. This state-value is essentially the discounted sum of all rewards we expect to get from this point onwards.

$\hat{v}(S', w)$ has a very similar meaning, with the only difference being that it's the value for the next state $S'$ instead of the previous state $S$. If we discount this by multiplying by $\gamma$, and add the observed reward $R$ to it, we get the part of the right-hand side of the equation before the minus: $R + \gamma \hat{v}(S', w)$. This essentially has the same meaning as $\hat{v}(S, w)$ (it is an estimate of the value of being in the previous state $S$), but this time it's based on some newly observed information ($R$) and an estimate of the value of the next state, instead of only being an estimate of a state in its entirety.

So, $\delta$ is the difference between two different ways of estimating exactly the same value, with one part (left of the minus) being expected to be a slightly more reliable estimate because it's based on a little bit more information that's known to be correct ($R$).

$\delta$ is positive if the transition from $S$ to $S'$ gave a greater reward $R$ than the critic expected, and negative if it was smaller than the critic expected (based on current parameterization $w$).

Shouldn't I be looking at the gradient of some objective function that I'm looking to minimize? Earlier in the chapter he states that we can regard performance of the policy simply as its value function, in which case is all we are doing just adjusting the parameters in the direction which maximizes the value of each state? I thought that that was supposed to be done by adjusting the policy, not by changing how we evaluate a state.

Yes, this should be done, and this is exactly what is done by the following line:

$\theta \gets \theta + \alpha I \delta \nabla_\theta \log \pi(A \mid S, \theta)$

However, that's not the only thing we want to update.

I can understand that you want to update the actor by incorporating information about the state-value (determined by the critic). This is done through the value of δ which incorporates said information, but I don't quite understand why it's looking at the gradient of the state-value function?

We ALSO want to do this, because the critic is supposed to always give as good an estimate as possible of the state value. If $\delta$ is nonzero, this means we made a mistake in the critic, so we also want to update the critic to become more accurate.

Best Answer

Related Solutions

Solved – Difference between Advantage Actor Critic and TD Actor Critic

Solved – How to understand REINFORCE with baseline is not a actor-critic algorithm

Related Question