Solved – How to perform deep Q-learning batch update step on a neural network with multiple outputs

deep learningneural networksq-learningreinforcement learning

I am taking on deep Q-learning and I am stuck at understanding one particular thing. I have googled multiple deep Q-learning examples, but literally everyone posting tutorials uses a cart-pole game to present the algorithm and this game does not encounter similar issues to my problem.

The original Deepmind's Volodymyr Mnih's paper (https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf) states the algorithm as follows:

The algorithm

I do not understand the part where $y_j$ is set.

In my problem, my Q function is an ANN with 64 inputs, one hidden layer of 48 neurons and 8 outputs. Each output represents an action (= I assume there are 8 actions available).

How do I set the $y_j$ however? Let's say I evaluate a state $s_{j+1}$ with my model Q and my output looks like this: $\begin{bmatrix} 0.2 & 0.4 & 0.2 & 0.1 & 0.02 & 0.02 & 0.02 & 0.1\end{bmatrix}$. Therefore $\text{max}_{a'}Q(s_{j+1},a';\theta) = 1$ (first element of the vector is indexed with 0).

State $s_{j+1}$ is non-terminal and current observed reward $r_j$ is 1. How do I set $y_j$? It should be of length 8, am I right?

Best Answer

In my problem, my Q function is an ANN with 64 inputs, one hidden layer of 48 neurons and 8 outputs. Each output represents an action (= I assume there are 8 actions available).

How do I set the $y_j$ however? Let's say I evaluate a state $s_{j+1}$ with my model Q and my output looks like this: $\begin{bmatrix} 0.2 & 0.4 & 0.2 & 0.1 & 0.02 & 0.02 & 0.02 & 0.1\end{bmatrix}$. Therefore $\text{max}_{a'}Q(s_{j+1},a';\theta) = 1$ (first element of the vector is indexed with 0).

That seems wrong, are you confusing max with argmax? $\text{max}_{a'}Q(\phi_{j+1},a';\theta) = 0.4$ in your example, using the notation of storing $\phi_{j+1} = \phi(s_{j+1})$ to match the pseudo-code (although you can use $s$ and $\phi$ almost interchangeably throughout in this case, $\phi_j$ is just the NN's input representation for $s_j$).

Looks like the document you are using also confuses max with argmax when setting $a_t$ - in that case it should be an argmax.

State $s_{j+1}$ is non-terminal and current observed reward $r_j$ is 1. How do I set $y_j$? It should be of length 8, am I right?

Your training vector needs to be of length 8, but it is only related to $y_j$, it is not the same as it. $y_j$ is the estimate for the target of a single $Q(\phi_j, a_j)$ value, whilst your network outputs 8 different action values $\begin{bmatrix} Q(\phi_j, a_0) & Q(\phi_j, a_1) & Q(\phi_j, a_2) & . . . \end{bmatrix}$. The Q-learning algorithm itself is not designed around the structure of neural networks (or any specific approximator). Instead you have to make the neural network training fit to what Q-learning does.

$y_j = 1 + 0.9 * 0.4 = 1.36$ assuming $\gamma = 0.9$ and the non-terminal result as you said. It is a single scalar value, and it relates to the target output for $a_j$ only. The other 7 target values in your vector, you do not have training values for . . .

So, you know nothing about the actions that were not taken. You have two basic choices:

  • Alter the loss function or gradient calculations so that the only important output for training purposes is that for $a_j$. Assuming $a_j$ was 5 for instance then your training target could then be $\begin{bmatrix} 0 & 0 & 0 & 0 & 0 & 1.36 & 0 & 0\end{bmatrix}$. Whether or not you can do this depends on your NN framework, but this would be the most efficient approach.

  • Run the network forward for $Q(\phi_j, *)$ to get the current output vector, and adjust the value of just $a_j$ output to equal $y_j$ for training. Again assuming $a_j$ was 5 for instance, and that you ran the network forward to get the outputs $\begin{bmatrix} 1.1 & 2.1 & 1.5 & 0.3 & 0.9 & 1.7 & 1.3 & 1.4\end{bmatrix}$ training target could then be $\begin{bmatrix}1.1 & 2.1 & 1.5 & 0.3 & 0.9 & 1.36 & 1.3 & 1.4\end{bmatrix}$. This has the advantage of being simple to drop in to standard NN frameworks whilst not messing around with custom loss functions or gradient functions.

Related Question