Solved – Why don’t we use importance sampling for one step Q-learning

q-learningreinforcement learning

Why don't we use importance sampling for 1-step Q-learning?

Q-learning is off-policy which means that we generate samples with a different policy than we try to optimize. Thus it should be impossible to estimate the expectation of the return for every state-action pair for the target policy by using samples generated with the behavior policy.

Here is the update rule for 1-step Q-learning:

$Q(s_t,a_t) = Q(s_t,a_t) + \alpha [R_{t+1} + \gamma \max_a{Q(s_{t+1},a) – Q(s_t,a_t)}]$

Here is a link to Sutton's RL book in case you want to look something up.

Best Answer

Here is the one-step Q-learning update rule as you gave it:

\begin{equation}Q(s_t, a_t) = Q(s_t, a_t) + \alpha \left[ R_{t+1} + \gamma \max_a Q(s_{t+1}, a) - Q(s_t, a_t) \right]\end{equation}

That update rule actually matches exactly what the "target policy" (greedy policy in this case) is doing; we update the $Q$ value for the state-action pair for which we've just obtained a new observation ($s_t, a_t$, with the new observation being $R_{t+1}$) under the assumption that we follow up with the greedy / target policy immediately afterwards (resulting in $\max_a Q(s_{t+1}, a)$). In this equation, the only action that we may not have taken according to the target policy is the action $a_t$, but that's fine because precisely that same action is the one for which we're updating the $Q$-value.


Now suppose that we try writing a multi-step (or two-step) update rule naively, without importance sampling. That would look as follows:

\begin{equation}Q(s_t, a_t) = Q(s_t, a_t) + \alpha \left[ R_{t+1} + \gamma R_{t+2} + \gamma^2 \max_a Q(s_{t+2}, a) - Q(s_t, a_t) \right]\end{equation}

This update rule assumes that our behaviour policy (typically something like $\epsilon$-greedy) was used to select another action $a_{t+1}$, resulting in an additional reward observation $R_{t+2}$ and state $s_{t+2}$. In this update rule, we suddenly have a problem for off-policy learning, because our update rule uses a reward $R_{t+2}$ that is the result of an action $a_{t+1}$ which our target policy may not have selected, and it's a different action ($a_t$) for which we're actually updating the $Q$-value. Using information from "incorrect" actions is fine if it's only used to update $Q$-values for those same "incorrect" actions, but it's not fine if we're using it to update $Q$-values for a different ("correct") action (an action that would also have been selected by the greedy / target policy).