Solved – why is PPO able to train for multiple epochs on the same minibatch

reinforcement learning

In PPO the authors say this of the objective function:

a novel objective function that enables multiple epochs of minibatch updates

But what is it about PPO's objective that allows this?

The PPO objective is as follows:

ratio $r = \frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)}$

loss $L = \min(r * advantage, clipped(r, 1 – epsilon, 1 + epsilon) * advantage)$

After the first minibatch update is applied we are using rollouts collected from a different policy to optimize on and this seems like it could cause problems (deadly triad).

Best Answer

The raw objective which is being optimized in both TRPO and PPO (meaning, the objective less the penalty terms / clipping), is a first order approximation of the policy gradient.

If we simply optimized the policy for multiple iterations without sampling new data, then the optimization would fail, since 1. the first order approximation would break down and 2. the expectation in the objective would be over a significantly different distribution than the one we are sampling from.

However, since TRPO penalizes the KL-divergence between the old and new policy, and PPO clips the policy ratio to disincentivize going too far from the original policy, we avoid both issues 1 and 2.

Related Question