Solved – Why do we clip the surrogate objective in PPO

deep learningreinforcement learning

I'm trying to understand the justification behind clipping in Proximal Policy Optimization (PPO).

In the paper "Proximal Policy Optimization Algorithms" (by John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford and Oleg Klimov), on page 3, equation 7 is written the following objective function

$$L^\text{CLIP}(\theta) = \mathbb{E}[\min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1 – \epsilon, 1 + \epsilon) \hat{A}_t)]$$

where $r_t(\theta) = \frac{\pi_{\theta}}{\pi_{\theta_\text{old}}}$, $\pi$ refers to a policy and $\hat{A}_t$ is an advantage estimator.

The paper justifies the clipping as follows:

removes the incentive for moving $r_t$ outside of the interval $[1 – \epsilon, 1 + \epsilon]$.

It seems to me that clipping would cause gradients to be $0$. Isn't this effectively throwing the samples away?

Best Answer

This doesn't set the gradient to zero. We just bound it. Its like setting the loss of an objective function we minimize to a smaller value so that the gradient updates are smaller.

Here, say that by clipping we make sure that the increase in the action probability at a state $\big(\pi(action|state)\big)$ of a "good" action is limited so that the change is not more that $\epsilon$ if the change is towards a positive direction. Hence the limiting $r(\theta)$ at $1+\epsilon$. If it is towards negative direction (meaning even though it is a good action with A>0, the probability is reducing), then we let it move.

Similarly if the ratio of a "bad" action (A<0) is falling too much, we say we don't want it to fall that much and clip it at $1-\epsilon$. We let it be any value higher than that.

Basically the update of the objective will try to increase the probability of good action hence increasing the ratio, a bad action's probability will be reduced. We are saying we don't want to change the probability of any action more than $\epsilon$

Best Answer

Related Solutions

Solved – why is PPO able to train for multiple epochs on the same minibatch

Solved – Is Proximal Policy Optimization (PPO) an on-policy reinforcement learning algorithm

Related Question