Solved – Is Proximal Policy Optimization (PPO) an on-policy reinforcement learning algorithm

machine learningreinforcement learning

If PPO is actually an on-policy algorithm, is it true that TRPO and A3C are also on-policy algorithms?

Best Answer

A3C is an actor-critic method, which tend to be on-policy (A3C itself is too), because the actor gradient is still computed with an expectation over trajectories sampled from that same policy.

TRPO and PPO are both on-policy. Basically they optimize a first-order approximation of the expected return while carefully ensuring that the approximation does not deviate too far from the underlying objective. Of course, this requires sampling new rollouts from the current policy frequently, so that the first-order approximation remains valid in a local region around the current parameter set $\theta$.

To be very pedantic, I suppose you could say this is off-policy in the sense that we are approximating the expected return of some policy $\pi_{\theta}$ with rollouts sampled from a very slightly older $\pi_{\theta_\text{old}}$, but that's not really off-policy in the conventional sense.

Related Question