Solved – When to choose SARSA vs. Q Learning

reinforcement learning

SARSA and Q Learning are both reinforcement learning algorithms that work in a similar way. The most striking difference is that SARSA is on policy while Q Learning is off policy. The update rules are as follows:

Q Learning:

$$Q(s_t,a_t)←Q(s_t,a_t)+α[r_{t+1}+γ\max_{a'}Q(s_{t+1},a')−Q(s_t,a_t)]$$

SARSA:

$$Q(s_t,a_t)←Q(s_t,a_t)+α[r_{t+1}+γQ(s_{t+1},a_{t+1})−Q(s_t,a_t)]$$

where $s_t,\,a_t$ and $r_t$ are state, action and reward at time step $t$ and $\gamma$ is a discount factor.

They mostly look the same except that in SARSA we take actual action and in Q Learning we take the action with highest reward.

Are there any theoretical or practical settings in which one should the prefer one over the other? I can see that taking the maximum in Q Learning can be costly and even more so in continuous action spaces. But is there anything else?

Best Answer

They mostly look the same except that in SARSA we take actual action and in Q Learning we take the action with highest reward.

Actually in both you "take" the actual single generated action $a_{t+1}$ next. In Q learning, you update the estimate from the maximum estimate of possible next actions, regardless of which action you took. Whilst in SARSA, you update estimates based on and take the same action.

This is probably what you meant by "take" in the question, but in the literature, taking an action means that it becomes the value of e.g. $a_{t}$, and influences $r_{t+1}$, $s_{t+1}$.

Are there any theoretical or practical settings in which one should the prefer one over the other?

Q-learning has the following advantages and disadvantages compared to SARSA:

  • Q-learning directly learns the optimal policy, whilst SARSA learns a near-optimal policy whilst exploring. If you want to learn an optimal policy using SARSA, then you will need to decide on a strategy to decay $\epsilon$ in $\epsilon$-greedy action choice, which may become a fiddly hyperparameter to tune.

  • Q-learning (and off-policy learning in general) has higher per-sample variance than SARSA, and may suffer from problems converging as a result. This turns up as a problem when training neural networks via Q-learning.

  • SARSA will approach convergence allowing for possible penalties from exploratory moves, whilst Q-learning will ignore them. That makes SARSA more conservative - if there is risk of a large negative reward close to the optimal path, Q-learning will tend to trigger that reward whilst exploring, whilst SARSA will tend to avoid a dangerous optimal path and only slowly learn to use it when the exploration parameters are reduced. The classic toy problem that demonstrates this effect is called cliff walking.

In practice the last point can make a big difference if mistakes are costly - e.g. you are training a robot not in simulation, but in the real world. You may prefer a more conservative learning algorithm that avoids high risk, if there was real time and money at stake if the robot was damaged.

If your goal is to train an optimal agent in simulation, or in a low-cost and fast-iterating environment, then Q-learning is a good choice, due to the first point (learning optimal policy directly). If your agent learns online, and you care about rewards gained whilst learning, then SARSA may be a better choice.

Related Question