Solved – In RL, why using a behavior policy instead of the target policy for an episode is interesting

neural networksreinforcement learning

I heard about off-policy methods in RL some days ago. While I understand the idea, the algorithms and the maths behind it, I'm not too sure why it's interesting to use a different policy in order to build the action value function?

In my book, it is said:

An advantage of this separation
is that the target policy may be deterministic (e.g., greedy), while the behavior policy can continue to
sample all possible actions.

I don't understand in what it makes it more interesting than an on policy method that only use a single policy?

I guess that it makes the exploration of new policies easier, but there must be other reasons, right?

Best Answer

I guess that it makes the exploration of new policies easier, but there must be other reasons, right?

It is essentially only this.

The more usual way of framing it is exploration vs exploitation. Theory tells you in standard MDPs that the optimal policy will be deterministic - each state has one "best" action that you should always take for optimal behaviour (sometimes more than one equivalent action, but in that circumstance you can always choose which one deterministically).

So you end up with some difficult issues using an on-policy control approach:

  • If you work with a single deterministic policy as your behaviour policy during learning, you will not learn enough about alternative behaviours in order to find better outcomes and improve the policy.

  • If you work with a stochastic policy that covers all possible choices - to guarantee exploration - you know that can never be optimal. Careful manipulation of how the stochastic part varies (such as reducing $\epsilon$ in $\epsilon$-greedy policies) can get you arbitrarily close to a truly optimal policy, but then you are using the same parameters to manipulate both exploration rate and closeness to optimality of your learned policy, which means compromising between best exploration and the best end results.

I don't understand in what it makes it more interesting than an on policy method that only use a single policy?

With off-policy learning, a target policy can be your best guess at deterministic optimal policy. Whilst your behaviour policy can be chosen based mainly on exploration vs exploitation issues, ignoring to some degree how the exploration rate affects how close to optimal the behaviour can get.

It is worth noting that some environments can be simple enough or stochastic enough with state transition and reward, that you can use a deterministic policy with and on-policy learner to explore and learn optimal control. However, you cannot rely on that in general, it would be more common that such an agent would stop learning at some point, far short of being optimal.

Related Question