Solved – How to understand REINFORCE with baseline is not a actor-critic algorithm

reinforcement learning

I read Sutton's RL book and I found that in page 333

Although the REINFORCE-with-baseline method learns both a policy and a state-value function, we do not consider it to be an actor–critic method because its state-value function is used only as a baseline, not as a critic. That is, it is not used for bootstrapping (updating the value estimate for a state from the estimated values of subsequent states), but only as a baseline for the state whose estimate is being updated.

The pseudo code of REINFORCE-with-baseline is
enter image description here

And the pseudo code of actor-critic is

enter image description here

In the above pseudo code, how can I understand bootstrapping, and I think REINFORCE-with-baseline and actor-critic are similar and it is hard for beginners to tell apart.

Best Answer

The difference is in how (and when) the prediction error estimate $\delta$ is calculated.

In REINFORCE with baseline:

$\qquad \delta \leftarrow G - \hat{v}(S_t,\mathbf{w})\qquad$ ; after the episode is complete

In Actor-critic:

$\qquad \delta \leftarrow R +\gamma \hat{v}(S',\mathbf{w}) - \hat{v}(S,\mathbf{w})\qquad$ ; online

Bootstrapping in RL is when the learned estimate $\hat{v}$ from a successor state $S'$ is used to construct the update for a preceding state $S$. This kind of self-reference to the learned model so far allows for updates at every step, but at the expense of initial bias towards however the model was initialised. On balance, the faster updates can often lead to more efficient learning. However the bias can lead to instability.

In REINFORCE, the final return $G$ is used instead, which is the same value as you would use in Monte Carlo control. The value of $G$ is not a bootstrap estimate, it is a direct sample of the return seen when behaving with the current policy. As a result it is not biased, but you have to wait to the end of each episode before applying updates.

Related Question