Solved – Off-policy importance sampling for TD(0)

importance-samplingreinforcement learning

Consider the off-policy value update
$V(s) \leftarrow V(s) + \alpha\frac{\pi(a\mid s)}{b(a\mid s)}[r_t+\gamma V(s') – V(s)]$

Where $\pi$ is the target policy (from which greedy actions are determined), $b$ is the behavior policy (from which exploratory actions are selected and executed), $\alpha$ is the learning rate. I see three different options here:

  1. Let $a=a_g$ be the greedy action, and $s'=s'_g$ be the state which results from the greedy action $a_g$ in $s$.
  2. Let $a=a_o$ be the exploratory action, i.e. the action that is actually executed, and $s'=s'_g$ be the state which results from executing the greedy action $a_g$ in $s$
  3. Let $a=a_g$ be the greedy action, and $s'=s'_o$ be the state which results from the non-greedy action $a_o$ in $s$.

Option 1 and 2 requires an environment model (which we must have in order to do control) in order to determine $s'_g$, which is never actually observed in the environment. So I assume that option 3 is the canonical form of off-policy value updates. Is that correct?

And, if you do have a model of the environment, from which you can sample $s'_g$, is then either option 1 or 2 preferable over 3, assuming my first assumption is correct?

Best Answer

The point of importance sampling is to adjust distribution of observed rewards under b towards those of expected rewards under $\pi$.

As such, you don't get to choose $a$ or $s'$ used in the update rule arbitrarily. They should be the observed values. Key to that is $a$ should be generated using behaviour policy $b(a|s)$, and $s'$ should be the resulting state when taking action $a$ in state $s$. You don't get to use any other values, otherwise you would not be sampling from the correct distribution, and any adjustments based on that assumption would be incorrect.

If you have an environment model and want to work with expected values, then you can use expected values under $\pi$ and you do not need importance sampling at all.

I think this is one point of confusion: As you have expressed this question as a value function $V$ in basic TD(0), then the equations you have shown cannot be used for control without a model of the environment, because you cannot do action selection for a greedy or $\epsilon$-greedy policy without an environment model. But that is a separate issue to understanding what importance sampling is doing.

Note that Q(0) or Expected SARSA(0) do not make use of importance sampling because in $Q(s,a)$ the action $a$ is already chosen, therefore the policy used to select it is not relevant, and TD error is calculated from bootstrap over $Q(s',*)$ in both algorithms without reference to behaviour policy.