Reinforcement Learning – SARSA Updates with S’A’ and Exploitation Policy

q-learningreinforcement learning

Why is it that we update the Q values using S' and A' and not the maximum as in Q-learning?

If the goal is to have a less aggressive exploitation policy, why don't we simply use an epsilon greedy action policy with a small/decaying epsilon, e.g. 0.1? That would make the program quite conservative as well. What's the advantage of updating using S'A' as opposed to using the optimal action?

Best Answer

SARSA's design is not directly to have more conservative policy that includes effects of exploration whilst learning. However, that is one of the outcomes.

The main difference between SARSA and Q-learning is that SARSA is on-policy - it learns the values of acting including the action choices used for exploration. Q-learning is off-policy, and learns the values of acting optimally more directly. There are a number of impacts of this choice:

  • SARSA must use the $(S', A')$ pair actually chosen by the policy when sampling. Q-learning must use its best guess at what the optimal policy would have chosen. You cannot change SARSA to use $\text{max}_{a'}Q(S',a')$, because otherwise the algorithms are identical - this would literally change SARSA into Q-learning*.

  • Q-learning's updates have more variance than SARSA's, which means it will take longer to converge, and this is part of what causes problems when using it with neural networks.

  • The conservatism of SARSA is due to the fact that it will calculate values that include the effects of exploration (the effect of $\epsilon$ probability choices of taking non-greedy actions). Therefore the values that it converges towards are true for the agent whilst it is learning, and any policy based around that will tend to avoid states with low reward outcomes due to action choice, sometimes even if they are the same states that can also have high outcomes (it will depend on precise values of course)

If the goal is to have a less aggressive exploitation policy, why don't we simply use an epsilon greedy action policy with a small/decaying epsilon, e.g. 0.1?

This will work up to a point. However, it will reduce the speed of learning and discovery of the agent. You are still faced with the exploration/exploitation trade-off, regardless of whether you use Q-learning or SARSA. You want the agent to learn how to act optimally, but it can only do so by sampling all possibilities over time, and thus act non-optimally, in order to find the optimal actions.

The difference between SARSA and Q-Learning in this respect is in how they act whilst exploring and learning. That gives you a choice of two approaches, each with strengths and weaknesses. Q-learning's big strength is directly learning the optimal policy, but it does so at the expense of being more likely to make certain kinds of error. SARSA's strength is that it will better optimise the values seen whilst learning, but it does this at the expense of not learning the optimal policy directly (so if that is your goal you will need to decay $\epsilon$ or use some other action choice mechanism that reduces exploration over time)


* You can design an algorithm that generalises Q-learning, in terms of this quality, and use it for on-policy or off-policy updates. This is called Expected SARSA.

Related Question