Solved – How to update network weights in DQN

deep rlq-learningreinforcement learning

"The output layer is a fully-connected linear layer with a
single output for each valid action." (from Mnih et al. 2015)
Lets say we have 4 actions, so the DQN outputs 4 values $(y_1, y_2, y_3, y_4)$ one for each action

But for a single training sample $x_t = (s_t, a_t, r_t, s_{t+1})$ we have only one output value $y_t = r_t + \gamma \max_{a} Q(s_t, a; \theta)$.
If $a_t = a_1$ then $y_t$ corresponds to output $y_1$ of the network then, how to update the network weights $\theta$, when $y_2, y_3, y_4$ are not known.

Best Answer

But for a single training sample $x_t = (s_t, a_t, r_t, s_{t+1})$ we have only one output value $y_t = r_t + \gamma \max_{a} Q(s_t, a; \theta)$. If $a_t = a_1$ then $y_t$ corresponds to output $y_1$ of the network then, how to update the network weights $\theta$, when $y_2, y_3, y_4$ are not known.

There are two approaches in practice:

  1. Use a network with a single output that uses features of state and action to predict a single Q value. When you take the maximum action to find $\max_{a} Q(s_t, a; \theta)$ then you need to run the predictor on a mini-batch with one value of $s_t$ and all possible values of $a_t$. This avoids the need to feedback the unknown targets for actions not taken.

  2. Use a network with one output per possible action. To train it, construct a target vector which only changes the value for the known action, setting all others as if the prediction was correct (TD error zero). This is easy to do as you already have the full vector from last time you looked for the max. Just take that vector, alter the target value in it only for the action just taken (with the reward plus discounted max from successor state), and use that as your supervised training target for the neural network.

The paper uses option 2. Note this sets the gradients from the unused actions to $0$, so they will have no direct impact on the weight updates. The effect is not that different from the first option - in both cases the new approximation will likely change the predicted returns for actions that were not taken.

Related Question