Solved – Why experience replay requires off-policy algorithm

reinforcement learning

In the paper introducing DQN "Playing Atari with Deep Reinforcement Learning", it mentioned:

Note that when learning by experience replay, it is necessary to learn off-policy
(because our current parameters are different to those used to generate the sample), which motivates
the choice of Q-learning.

I didn't quite understand what it means. What if we use SARSA and remember the action a' for the action we are to take in s' in our memory, and then sample batches from it and update Q like we did in DQN? And, can actor-critic methods (A3C, for specific) use experience replay? If not, why?

Best Answer

The on-policy methods, like SARSA, expects that the actions in every state are chosen based on the current policy of the agent, that usually tends to exploit rewards.

Doing so, the policy gets better when we update our policy based on the last rewards. Here in particular, they update the parameters of the NN that predicts the value of a certain state/action).

But, if we update our policy based on stored transitions, like in experience replay, we are actually evaluating actions from a policy that is no longer the current one, since it evolved in time, thus making it no longer on-policy.

The Q values are evaluated based on the future rewards that you will get from a state following the current agent policy.

However, that is no longer true since you are now following a different policy. So they use a common off-policy method that explores based on an epsilon-greedy approach.

Related Solutions

Solved – Actor-critic loss function in reinforcement learning

Let's first try to build a solid understanding of what $\delta$ means. Maybe you know all of this, but it's good to go over it anyway in my opinion.

$\delta \gets R + \gamma \hat{v}(S', w) - \hat{v}(S, w)$

Let's start with the $\hat{v}(S, w)$ term. That term is the value of being in state $S$, as estimated by the critic under the current parameterization $w$. This state-value is essentially the discounted sum of all rewards we expect to get from this point onwards.

$\hat{v}(S', w)$ has a very similar meaning, with the only difference being that it's the value for the next state $S'$ instead of the previous state $S$. If we discount this by multiplying by $\gamma$, and add the observed reward $R$ to it, we get the part of the right-hand side of the equation before the minus: $R + \gamma \hat{v}(S', w)$. This essentially has the same meaning as $\hat{v}(S, w)$ (it is an estimate of the value of being in the previous state $S$), but this time it's based on some newly observed information ($R$) and an estimate of the value of the next state, instead of only being an estimate of a state in its entirety.

So, $\delta$ is the difference between two different ways of estimating exactly the same value, with one part (left of the minus) being expected to be a slightly more reliable estimate because it's based on a little bit more information that's known to be correct ($R$).

$\delta$ is positive if the transition from $S$ to $S'$ gave a greater reward $R$ than the critic expected, and negative if it was smaller than the critic expected (based on current parameterization $w$).

Shouldn't I be looking at the gradient of some objective function that I'm looking to minimize? Earlier in the chapter he states that we can regard performance of the policy simply as its value function, in which case is all we are doing just adjusting the parameters in the direction which maximizes the value of each state? I thought that that was supposed to be done by adjusting the policy, not by changing how we evaluate a state.

Yes, this should be done, and this is exactly what is done by the following line:

$\theta \gets \theta + \alpha I \delta \nabla_\theta \log \pi(A \mid S, \theta)$

However, that's not the only thing we want to update.

I can understand that you want to update the actor by incorporating information about the state-value (determined by the critic). This is done through the value of δ which incorporates said information, but I don't quite understand why it's looking at the gradient of the state-value function?

We ALSO want to do this, because the critic is supposed to always give as good an estimate as possible of the state value. If $\delta$ is nonzero, this means we made a mistake in the critic, so we also want to update the critic to become more accurate.

Solved – Can someone please explain the target update for deep Q learning

Why do we need to train on random transitions instead the transition that we were just in? Logically, we just performed an action in a given state and received a reward, don't we want to update our network based on what we just did? i.e., tell the network how well that action was?

This is because we're training a Neural Network. In "regular" Reinforcement Learning (tabular for example), your suggestion is indeed much more common. The main problem with the idea of always updating on the latest sample of experience is that there will be an extremely large amount of correlation between our most recent samples of experience. If every state transition is only a small step in a large environment, we'll have lots of consecutive learning steps all for very similar states; they're highly correlated with respect to time (consecutive state transitions), and also likely to be highly correlated with respect to the states themselves if every individual action only creates a relatively small change in the state.

The problem with a bunch of consecutive, highly correlated updates for similar states is that training a Neural Network with that kind of input is problematic; a Neural Network is prone to "forgetting" things it has learned previously for certain inputs if you don't repeatedly keep showing it those kinds of inputs. So, if we take 100 small steps that are all kind of similar in terms of time and state representation, and consecutively train our network on all of those without also showing a bunch of more diverse, older inputs in between, it can easily forget everything it has learned about those older inputs and focus only on performing well for these few recent inputs.

By randomly sampling the inputs we present to our network from a large experience replay buffer, which also contains significantly older experiences, we can ensure that we have a diverse set of inputs to learn from, and avoid forgetting about things we learned before.

Why is that if $ss′$ is a terminal state, then we set the target (again, optimal predicted Q-value) to be the reward?

Hmm, it looks like there's a typo in the pseudocode, the $\max$ should be taken over all $aa'$, not $a'$. Anyway, this is not specific to Deep RL, this would happen in regular RL too. The term we choose to ignore here ($\gamma \max_{aa'} Q(ss', aa')$) can intuitively be interpreted as the best discounted returns we expect to receive in the future, after reaching state $ss'$. What future returns do you expect to receive after reaching a terminal state? Well, the episode ends right there, it's a terminal state, so we do not expect to receive anything more: we expect to receive $0$. That's why we cut off that part of the equation, the only reasonable thing we can expect there is $0$.

Why do we update the transition $tt$ as $rr+\gamma \max_{aa′}Q(ss′,aa′)$? How does this improve the optimal Q value? It seems that we are replacing our older prediction with newer prediction, but the information that we have gained (i.e., $<s,a,r,s′>$ that we just played) does not seem to cause any improvement in the prediction $tt$, since we are changing the states $ss′$ (uniformly sampled histories) that may have nothing to do with the state $s$ we were just in. What is the point of updating random histories that has nothing to do with the new experience?

The same answer as for your first question holds here. Note that the new experience is also stored in the replay buffer, so we do expect to eventually learn something from this new experience too. We just might not learn from it right now, might learn instead from some other older experiences (from which we maybe also did not learn anything yet). It's all just about diversifying our inputs again, making sure we see a little bit of everything.

Why do we limit the number of batches to train on? Suppose I allocate a replay memory of 1000. I store 1000 histories. In DQN paper, they only do a batch of 32 every iteration. Why not all 1000? Seems that if we want to improve our target, more is better.

One reason can simply be the amount of RAM (or VRAM in a GPU) we have available; if you increase the batch size, you increase the amount of memory you require for your computations, so eventually you're going to hit a limit. I don't think that's the only reason though.

Using a smaller batch size essentially increases the variance of your gradient estimates. This can slow down learning a bit, because your gradient estimates (and therefore directions in which you're performing updates) can bounce all over the place a bit (high variance). Sometimes you'll perform an update in one direction, and perform another update in the opposite direction afterwards. This obviously slows you down a bit. However, it also adds a bit of additional "exploration", you get to investigate the "landscape" of your objective function a bit more, you're less likely to tunnel-vision on a local minimum of your loss function and overlook a better minimum somewhere else.

Larger batch sizes result in more stable gradient estimates, less variance, you'll more quickly find the direction you want to go in and keep going there.

So, there's really nothing definitive to say about which one is better, smaller or larger batch size, it's really just a hyperparameter that must be tuned. There are advantages and disadvantages to either. Intuitively, especially early on in training a smaller batch size (more variance) seems helpful to me: the gradient estimates of a larger batch size may be more stable, more "precise", but early on in training it is based on highly flawed data anyway; data generated by close to a random agent early on in training. Do you really immediately want to tunnel-vision on a local optimum based on such flawed data? Probably not, bouncing around a bit with higher variance seems helpful. There certainly has been research using larger batch sizes though, with hardware that supports it. See, for example, this paper.

Best Answer

Related Solutions

Solved – Actor-critic loss function in reinforcement learning

Solved – Can someone please explain the target update for deep Q learning

Related Question