Solved – SARSA update rule

machine learningq-learningreinforcement learning

The update rule for the SARSA algorithm, which is mentioned here is the following.

$Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha [r_{t+1} + \gamma Q(s_{t+1}, a_{t+1})-Q(s_t,a_t)]$

My question is, why can't the value functions be used in the RHS of the update rather than choosing a sample action? In other words, why can't the update be as follows?

$Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha [r_{t+1} + \gamma V(s_{t+1})-Q(s_t,a_t)]$

Will this update rule work? If it does, what advantage does the SARSA update have over this update rule?

Best Answer

I agree with what Sean said. I'll add this little bit to answer this question you asked more concretely:

Will this update rule work? If it does, what advantage does the SARSA update have over this update rule?

The SARSA update rule can converge to different values than the Q-learning rule (which is, like Sean said, essentially what you suggested). This is due to the difference between on-policy and off-policy that he also described. An on-policy algorithm (like the SARSA update rule) converges to the optimal values for the policy that your agent is also using to gather experience. Off-policy algorithms converge to values for a policy that is different from the policy being followed by the agent to gather experience.

The behaviour policy (the one that the agent uses to gather experience) is typically going to be something like $\epsilon$-greedy, where with some nonzero probability you select suboptimal (random) actions. An on-policy algorithm like SARSA takes this into account, it converges to values that are still correct given the knowledge that your agent is sometimes going to be "stupid" and do something random. Q-learning (off-policy learning with a pure greedy policy as "target policy", the policy you're computing values for) is going to converge to values that are actually only going to be correct later on, when your agent switches over to a completely greedy policy.

This distinction can, for example, be important in situations where you care about learning "safe" behaviour during the learning process, where you don't just care about learning optimal behaviour to run after the learning process. Suppose, for example, that you have a robot who starts near a cliff, and needs to walk to another point along the same cliff.

  • A Q-learning algorithm is going to converge to values that tell the agent that it's optimal to walk right along the cliff, because that's the shortest path.
  • SARSA is going to converge to different values, which will tell the agent that it's optimal to first walk a safe distance away from the cliff, and then move towards the goal, only getting close to the cliff again when also being close to the destination. It will converge to these values because it knows of itself that it will sometimes be stupid and take a random action. If it happens to be close to the cliff when it takes a random action, it will fall down and you have to buy a new robot which can be very expensive. So, it will learn (because it's learning values for the $\epsilon$-greedy policy) that the optimal behaviour is to first walk away from the cliff, so that it won't fall even if it sometimes takes a random action.