Solved – Can someone please explain the target update for deep Q learning

deep learningreinforcement learning

The algorithm for deep Q learning is taken from here https://ai.intel.com/demystifying-deep-reinforcement-learning/

enter image description here

I am confused about how the target is calculated, and why it is so.

To train the network, first, we perform an action and get a new state $s^\prime$

We store the state in the replay memory $D$

Now we sample a number of random transitions from the replay memory.

We then calculate (more precisely, update) the target for each minibatch transition. I understand that we need to update our target so it becomes closer to the optimal expected discounted reward.

But I don't understand why we are doing it with random batches.

My questions:

  1. Why do we need to train on random transitions instead the transition that we were just in? Logically, we just performed an action in a given state and received a reward, don't we want to update our network based on what we just did? i.e., tell the network how well that action was?

  2. Why is that if $ss^\prime$ is a terminal state, then we set the target (again, optimal predicted Q-value) to be the reward?

  3. Why do we update the transition $tt$ as $rr + \gamma max_{a^\prime} Q(ss^\prime, aa^\prime)$? How does this improve the optimal Q value? It seems that we are replacing our older prediction with newer prediction, but the information that we have gained (i.e., $<s,a,r, s^\prime>$ that we just played) does not seem to cause any improvement in the prediction $tt$, since we are changing the states $ss^\prime$ (uniformly sampled histories) that may have nothing to do with the state $s$ we were just in. What is the point of updating random histories that has nothing to do with the new experience?

To put my concern in a even more simplified form:

new information gained: $<s, r, a, s^\prime>$

update prediction for: random $<ss, rr, aa, ss^\prime>$

How does the new information help with the prediction update?

From another perspective, in the supervised case, we gain improvement because the targets is labelled (i.e., we know that the image is a dog a-priori), therefore we are matching input to target and training our network to do it better and better. But we don't have the optimal target in this case…so I can't see improvement

  1. Why do we limit the number of batches to train on? Suppose I allocate a replay memory of 1000. I store 1000 histories. In DQN paper, they only do a batch of 32 every iteration. Why not all 1000? Seems that if we want to improve our target, more is better.

Best Answer

  1. Why do we need to train on random transitions instead the transition that we were just in? Logically, we just performed an action in a given state and received a reward, don't we want to update our network based on what we just did? i.e., tell the network how well that action was?

This is because we're training a Neural Network. In "regular" Reinforcement Learning (tabular for example), your suggestion is indeed much more common. The main problem with the idea of always updating on the latest sample of experience is that there will be an extremely large amount of correlation between our most recent samples of experience. If every state transition is only a small step in a large environment, we'll have lots of consecutive learning steps all for very similar states; they're highly correlated with respect to time (consecutive state transitions), and also likely to be highly correlated with respect to the states themselves if every individual action only creates a relatively small change in the state.

The problem with a bunch of consecutive, highly correlated updates for similar states is that training a Neural Network with that kind of input is problematic; a Neural Network is prone to "forgetting" things it has learned previously for certain inputs if you don't repeatedly keep showing it those kinds of inputs. So, if we take 100 small steps that are all kind of similar in terms of time and state representation, and consecutively train our network on all of those without also showing a bunch of more diverse, older inputs in between, it can easily forget everything it has learned about those older inputs and focus only on performing well for these few recent inputs.

By randomly sampling the inputs we present to our network from a large experience replay buffer, which also contains significantly older experiences, we can ensure that we have a diverse set of inputs to learn from, and avoid forgetting about things we learned before.

  1. Why is that if $ss′$ is a terminal state, then we set the target (again, optimal predicted Q-value) to be the reward?

Hmm, it looks like there's a typo in the pseudocode, the $\max$ should be taken over all $aa'$, not $a'$. Anyway, this is not specific to Deep RL, this would happen in regular RL too. The term we choose to ignore here ($\gamma \max_{aa'} Q(ss', aa')$) can intuitively be interpreted as the best discounted returns we expect to receive in the future, after reaching state $ss'$. What future returns do you expect to receive after reaching a terminal state? Well, the episode ends right there, it's a terminal state, so we do not expect to receive anything more: we expect to receive $0$. That's why we cut off that part of the equation, the only reasonable thing we can expect there is $0$.

  1. Why do we update the transition $tt$ as $rr+\gamma \max_{aa′}Q(ss′,aa′)$? How does this improve the optimal Q value? It seems that we are replacing our older prediction with newer prediction, but the information that we have gained (i.e., $<s,a,r,s′>$ that we just played) does not seem to cause any improvement in the prediction $tt$, since we are changing the states $ss′$ (uniformly sampled histories) that may have nothing to do with the state $s$ we were just in. What is the point of updating random histories that has nothing to do with the new experience?

The same answer as for your first question holds here. Note that the new experience is also stored in the replay buffer, so we do expect to eventually learn something from this new experience too. We just might not learn from it right now, might learn instead from some other older experiences (from which we maybe also did not learn anything yet). It's all just about diversifying our inputs again, making sure we see a little bit of everything.

  1. Why do we limit the number of batches to train on? Suppose I allocate a replay memory of 1000. I store 1000 histories. In DQN paper, they only do a batch of 32 every iteration. Why not all 1000? Seems that if we want to improve our target, more is better.

One reason can simply be the amount of RAM (or VRAM in a GPU) we have available; if you increase the batch size, you increase the amount of memory you require for your computations, so eventually you're going to hit a limit. I don't think that's the only reason though.

Using a smaller batch size essentially increases the variance of your gradient estimates. This can slow down learning a bit, because your gradient estimates (and therefore directions in which you're performing updates) can bounce all over the place a bit (high variance). Sometimes you'll perform an update in one direction, and perform another update in the opposite direction afterwards. This obviously slows you down a bit. However, it also adds a bit of additional "exploration", you get to investigate the "landscape" of your objective function a bit more, you're less likely to tunnel-vision on a local minimum of your loss function and overlook a better minimum somewhere else.

Larger batch sizes result in more stable gradient estimates, less variance, you'll more quickly find the direction you want to go in and keep going there.

So, there's really nothing definitive to say about which one is better, smaller or larger batch size, it's really just a hyperparameter that must be tuned. There are advantages and disadvantages to either. Intuitively, especially early on in training a smaller batch size (more variance) seems helpful to me: the gradient estimates of a larger batch size may be more stable, more "precise", but early on in training it is based on highly flawed data anyway; data generated by close to a random agent early on in training. Do you really immediately want to tunnel-vision on a local optimum based on such flawed data? Probably not, bouncing around a bit with higher variance seems helpful. There certainly has been research using larger batch sizes though, with hardware that supports it. See, for example, this paper.

Related Question