Solved – Why do temporal difference (TD) methods have lower variance than Monte Carlo methods

machine learningmonte carloreinforcement learningtemporal difference

In many reinforcement learning papers, it is stated that for estimating the value function, one of the advantages of using temporal difference methods over the Monte Carlo methods is that they have a lower variance for computing value function. I have not been able to find any formal proof for this.

Moreover, it is also said that the Monte Carlo method is less biased when compared with TD methods. If somebody can help me better understand this phenomenon, I would appreciate it.

Best Answer

The difference between the algorithms, is how they set a new value target based on experience.

Using action values to make it a little more concrete, and sticking with on-policy evaluation (not control) to keep arguments simple, then the update to estimate $Q(S_t,A_t)$ takes the same general form for both TD and Monte Carlo:

$$Q(S_t,A_t) = Q(S_t,A_t) + \alpha(X - Q(S_t,A_t))$$

Where $X$ is the value of an estimate for the true value of $Q(S_t,A_t)$ gained through some experience. When discussing whether an RL value-based technique is biased or has high variance, the part that has these traits is whatever stands for $X$ in this update.

For Monte Carlo techniques, the value of $X$ is estimated by following a sample trajectory starting from $(S_t,A_t)$ and adding up the rewards to the end of an episode (at time $\tau$):

$$\sum_{k=0}^{\tau-t-1} \gamma^kR_{t+k+1}$$

For temporal difference, the value of $X$ is estimated by taking one step of sampled reward, and bootstrapping the estimate using the Bellman equation for $q_{\pi}(s,a)$:

$$R_{t+1} + \gamma Q(S_{t+1}, A_{t+1})$$

Bias

In both cases, the true value function that we want to estimate is the expected return after taking the action $A_t$ in state $S_t$ and following policy $\pi$. This can be written:

$$q(s,a) = \mathbb{E}_{\pi}[\sum_{k=0}^{\tau-t-1} \gamma^kR_{t+k+1}|S_t=s, A_t=a]$$

That should look familiar. The Monte Carlo target for $X$ is clearly a direct sample of this value, and has the same expected value that we are looking for. Hence it is not biased.

TD learning however, has a problem due to initial states. The bootstrap value of $Q(S_{t+1}, A_{t+1})$ is initially whatever you set it to, arbitrarily at the start of learning. This has no bearing on the true value you are looking for, hence it is biased. Over time, the bias decays exponentially as real values from experience are used in the update process. At least that is true for basic tabular forms of TD learning. When you add a neural network or other approximation, then this bias can cause stability problems, causing an RL agent to fail to learn.

Variance

Looking at how each update mechanism works, you can see that TD learning is exposed to 3 factors, that can each vary (in principle, depending on the environment) over a single time step:

  • What reward $R_{t+1}$ will be returned

  • What the next state $S_{t+1}$ will be.

  • What the policy will choose for $A_{t+1}$

Each of these factors may increase the variance of the TD target value. However, the bootstrapping mechanism means that there is no direct variance from other events on the trajectory. That's it.

In comparison, the Monte Carlo return depends on every return, state transition and policy decision from $(S_t, A_t)$ up to $S_{\tau}$. As this is often multiple time steps, the variance will also be a multiple of that seen in TD learning on the same problem. This is true even in situations with deterministic environments with sparse deterministic rewards, as an exploring policy must be stochastic, which injects some variance on every time step involved in the sampled value estimate.