The on-policy methods, like SARSA, expects that the actions in every state are chosen based on the current policy of the agent, that usually tends to exploit rewards.
Doing so, the policy gets better when we update our policy based on the last rewards. Here in particular, they update the parameters of the NN that predicts the value of a certain state/action).
But, if we update our policy based on stored transitions, like in experience replay, we are actually evaluating actions from a policy that is no longer the current one, since it evolved in time, thus making it no longer on-policy.
The Q values are evaluated based on the future rewards that you will get from a state following the current agent policy.
However, that is no longer true since you are now following a different policy. So they use a common off-policy method that explores based on an epsilon-greedy approach.
- Why do we need to train on random transitions instead the transition that we were just in? Logically, we just performed an action in a given state and received a reward, don't we want to update our network based on what we just did? i.e., tell the network how well that action was?
This is because we're training a Neural Network. In "regular" Reinforcement Learning (tabular for example), your suggestion is indeed much more common. The main problem with the idea of always updating on the latest sample of experience is that there will be an extremely large amount of correlation between our most recent samples of experience. If every state transition is only a small step in a large environment, we'll have lots of consecutive learning steps all for very similar states; they're highly correlated with respect to time (consecutive state transitions), and also likely to be highly correlated with respect to the states themselves if every individual action only creates a relatively small change in the state.
The problem with a bunch of consecutive, highly correlated updates for similar states is that training a Neural Network with that kind of input is problematic; a Neural Network is prone to "forgetting" things it has learned previously for certain inputs if you don't repeatedly keep showing it those kinds of inputs. So, if we take 100 small steps that are all kind of similar in terms of time and state representation, and consecutively train our network on all of those without also showing a bunch of more diverse, older inputs in between, it can easily forget everything it has learned about those older inputs and focus only on performing well for these few recent inputs.
By randomly sampling the inputs we present to our network from a large experience replay buffer, which also contains significantly older experiences, we can ensure that we have a diverse set of inputs to learn from, and avoid forgetting about things we learned before.
- Why is that if $ss′$ is a terminal state, then we set the target (again, optimal predicted Q-value) to be the reward?
Hmm, it looks like there's a typo in the pseudocode, the $\max$ should be taken over all $aa'$, not $a'$. Anyway, this is not specific to Deep RL, this would happen in regular RL too. The term we choose to ignore here ($\gamma \max_{aa'} Q(ss', aa')$) can intuitively be interpreted as the best discounted returns we expect to receive in the future, after reaching state $ss'$. What future returns do you expect to receive after reaching a terminal state? Well, the episode ends right there, it's a terminal state, so we do not expect to receive anything more: we expect to receive $0$. That's why we cut off that part of the equation, the only reasonable thing we can expect there is $0$.
- Why do we update the transition $tt$ as $rr+\gamma \max_{aa′}Q(ss′,aa′)$? How does this improve the optimal Q value? It seems that we are replacing our older prediction with newer prediction, but the information that we have gained (i.e., $<s,a,r,s′>$ that we just played) does not seem to cause any improvement in the prediction $tt$, since we are changing the states $ss′$ (uniformly sampled histories) that may have nothing to do with the state $s$ we were just in. What is the point of updating random histories that has nothing to do with the new experience?
The same answer as for your first question holds here. Note that the new experience is also stored in the replay buffer, so we do expect to eventually learn something from this new experience too. We just might not learn from it right now, might learn instead from some other older experiences (from which we maybe also did not learn anything yet). It's all just about diversifying our inputs again, making sure we see a little bit of everything.
- Why do we limit the number of batches to train on? Suppose I allocate a replay memory of 1000. I store 1000 histories. In DQN paper, they only do a batch of 32 every iteration. Why not all 1000? Seems that if we want to improve our target, more is better.
One reason can simply be the amount of RAM (or VRAM in a GPU) we have available; if you increase the batch size, you increase the amount of memory you require for your computations, so eventually you're going to hit a limit. I don't think that's the only reason though.
Using a smaller batch size essentially increases the variance of your gradient estimates. This can slow down learning a bit, because your gradient estimates (and therefore directions in which you're performing updates) can bounce all over the place a bit (high variance). Sometimes you'll perform an update in one direction, and perform another update in the opposite direction afterwards. This obviously slows you down a bit. However, it also adds a bit of additional "exploration", you get to investigate the "landscape" of your objective function a bit more, you're less likely to tunnel-vision on a local minimum of your loss function and overlook a better minimum somewhere else.
Larger batch sizes result in more stable gradient estimates, less variance, you'll more quickly find the direction you want to go in and keep going there.
So, there's really nothing definitive to say about which one is better, smaller or larger batch size, it's really just a hyperparameter that must be tuned. There are advantages and disadvantages to either. Intuitively, especially early on in training a smaller batch size (more variance) seems helpful to me: the gradient estimates of a larger batch size may be more stable, more "precise", but early on in training it is based on highly flawed data anyway; data generated by close to a random agent early on in training. Do you really immediately want to tunnel-vision on a local optimum based on such flawed data? Probably not, bouncing around a bit with higher variance seems helpful. There certainly has been research using larger batch sizes though, with hardware that supports it. See, for example, this paper.
Best Answer
What you are describing at the end is online learning, where we are continually updating the approximation of $Q$.
There is also the possibility of waiting until the end of the episode before sampling from $D(s,a,r,s')$. This can allow you to ground your reward $r$ at each step, in case the reward is delayed.