Monte Carlo vs Temporal Difference – When are Monte Carlo Methods Preferred Over Temporal Difference Ones?

monte carloreinforcement learningtemporal difference

I've been doing a lot of research about Reinforcement Learning lately. I followed Sutton & Barto's Reinforcement Learning: An Introduction for most of this.

I know what Markov Decision Processes are and how Dynamic Programming (DP), Monte Carlo and Temporal Difference (DP) learning can be used to solve them. The problem I'm having is that I don't see when Monte Carlo would be the better option over TD-learning.

The main difference between them is that TD-learning uses bootstrapping to approximate the action-value function and Monte Carlo uses an average to accomplish this. I just can't really think of a scenario when this is the better way to go.

My guess is that it might have something to do with performance but I can't find any sources that can proof this.

Am I missing something or is TD-learning generally the better option?

Best Answer

The main problem with TD learning and DP is that their step updates are biased on the initial conditions of the learning parameters. The bootstrapping process typically updates a function or lookup Q(s,a) on a successor value Q(s',a') using whatever the current estimates are in the latter. Clearly at the very start of learning these estimates contain no information from any real rewards or state transitions.

If learning works as intended, then the bias will reduce asymptotically over multiple iterations. However, the bias can cause significant problems, especially for off-policy methods (e.g. Q Learning) and when using function approximators. That combination is so likely to fail to converge that it is called the deadly triad in Sutton & Barto.

Monte Carlo control methods do not suffer from this bias, as each update is made using a true sample of what Q(s,a) should be. However, Monte Carlo methods can suffer from high variance, which means more samples are required to achieve the same degree of learning compared to TD.

In practice, TD learning appears to learn more efficiently if the problems with the deadly triad can be overcome. Recent results using experience replay and staged "frozen" copies of estimators provide work-arounds that address problems - e.g. that is how DQN learner for Atari games was built.

There is also a middle ground between TD and Monte Carlo. It is possible to construct a generalised method that combines trajectories of different lengths - from single-step TD to complete episode runs in Monte Carlo - and combine them. The most common variant of this is TD($\lambda$) learning, where $\lambda$ is a parameter from $0$ (effectively single-step TD learning) to $1$ (effectively Monte Carlo learning, but with a nice feature that it can be used in continuous problems). Typically, a value between $0$ and $1$ makes the most efficient learning agent - although like many hyperparameters, the best value to use depends on the problem.

If you are using a value-based method (as opposed to a policy-based one), then TD learning is generally used more in practice, or a TD/MC combination method such as TD(λ) can be even better.

In terms of "practical advantage" for MC? Monte Carlo learning is conceptually simple, robust and easy to implement, albeit often slower than TD. I would generally not use it for a learning controller engine (unless in a hurry to implement something for a simple environment), but I would seriously consider it for policy evaluation in order to compare multiple agents for instance - that is due to it being an unbiased measure, which is important for testing.

Related Question