Solved – Minibatching in Stochastic Gradient Descent and in Q-Learning

deep learningq-learningreinforcement learning

Background (may be skipped):

In training neural networks, usually stochastic gradient descent (SGD) is used: instead of computing the network's error on all members of the training set and updating the weights by gradient descent (which means waiting a long time before each weight update), use each time a minbatch of members, and treat the resulting error as an unbiased estimation of the true error.

In reinforcement learning, sometimes Q-learning is implemented with a neural network (as in deep Q-learning), and experience replay is used: Instead of updating the weights by the previous (state,action,reward) of the agent, update using a minibatch of random samples of old (states,actions,rewards), so that there is no correlation between subsequent updates.

The Question:

Is the following assertion correct?: When minibatching in SGD, one weights update is performed per the whole minibatch, while when minibatching in Q-learning, one weights update is performed per each member in the minibatch?

Best Answer

The answer is no. The Q-network's parameters can be updated at once using all examples in a minibatch. Denote the members of the minibatch by $(s_1,a_1,r_1,s'_1),(s_2,a_2,r_2,s'_2),...,(s_M,a_M,r_M,s'_M)$ Then the loss is estimated relative to the current Q-network's parameters:

$$\hat{L}(\theta)=\frac{1}{M}\sum_{i=1}^M(Q(s_i,a_i;\theta)-(r_i+\gamma\max_{a'}{Q(s'_i,a';\theta)}))^2$$

This is an estimation of the true loss, which is an expectation over all $(s,a,r)$. In this way, the updating of the parameters of Q is like in SGD.

Notes:

  • The estimation is biassed since it does not contain a term representing the variance due to $s'$, but this does not change the direction of the gradient.
  • Sometimes, the second set of parameters $\theta$ in the squared expression is not the current one but a past one (double Q-learning).