Solved – Why does Off-Policy Monte Carlo Control only learn from the “Tails of Episodes”

importance-samplingmonte carloreinforcement learning

I was reading through section 5.7 of the second edition of Sutton and Barto's "Reinforcement Learning: An Introduction" when I came across this passage:

A Pretty Perplexing Passage

where the "method" that the author is referring to is the incremental, every-visit implementation for Off-Policy MC Control with weighted importance sampling covered earlier in the section and provided here for reference:

Off-Policy MC Control

My question is this: why does the method above only learn from the tails of the episode, as the author suggests in the excerpt?

After searching around a bit, the only relevant link I could find online was here:

http://incompleteideas.net/sutton/book/first/5/node7.html

which is an online version of the first edition of the text. An excerpt from this link clarifies that what the author means by "the tails of episodes" is in fact "after the last nongreedy action." However, I still fail to see why all of the remaining actions after some point in an episode are guaranteed to be greedy, or how these greedy actions belong to the only time steps from which the above method can learn.

Best Answer

I still fail to see why all of the remaining actions after some point in an episode are guaranteed to be greedy,

It is not the case, there is no guarantee of any number of steps being greedy. The very last step can be non-greedy. In that case, then the only return that will get assessed in off-policy Monte Carlo control is for $S_{T-1}, A_{T-1}$. That step does get processed (and in general the last exploratory step in any trajectory), because you don't apply importance sampling to the action in the $Q(s,a)$ that is being updated.

In general, there will be a number of steps prior to the end of the episode where the behaviour policy took an action that was valid in the target policy. The number will be anything from 0 to the length of the entire episode. Distribution of this number depends on the overlap between behaviour policy and target policy. An $\epsilon$-greedy behaviour policy learning a greedy target policy may have relatively long series where the actions are greedy, depending on value of $\epsilon$.

or how these greedy actions belong to the only time steps from which the above method can learn

This is due to weighted importance sampling. The importance sampling ratio, applied on each time step with target policy $\pi$ and behaviour policy $b$ is:

$$\frac{\pi(a|s)}{b(a|s)}$$

If the target policy is the greedy policy over current $Q(s,a)$ (which it is in the given Monte Carlo Control algorithm), then any exploration will insert a multiplication by 0 into the importance sampling ratio (because $\pi(a|s)=0$ for exploring actions), making the adjusted return estimate zero regardless of any other values. In addition, the weighting when taking the mean also uses this factor, so there will be no change to any $(s,a)$ pair with a later exploratory action.

If you used non-weighted importance sampling, then in theory you do need to add these zeros into the mean Q estimates, because the denominator for the average return is the number of visits to each $(s,a)$. This adding of zeroes counters the over-estimates made when the trajectory from that point on is greedy (and multiplied by a factor of $\frac{1}{(1-\epsilon + \frac{\epsilon}{|\mathcal{A}|})^{T-t-1}}$ for example if you are using $\epsilon$-greedy).

In a sense then it is Monte Carlo Control with weighted importance sampling which has this limitation (of only learning from tails of episodes). However, un-weighted importance sampling is not learning much, it is just learning a correction to its over-estimates elsewhere. And in general weighted importance sampling is preferred due to lower variance.

Related Question