Solved – Q-learning with Neural Network as function approximation

neural networksreinforcement learning

I am trying to use a Neural Network in order to approximate the Q-value in Q-learning as in Questions about Q-Learning using Neural Networks. As suggested in the first answer, I am using a linear activation function for the output layer, while I am still using the sigmoid activation function in the hidden layers (2, although I can change this later on). I am also using a single NN that returns an output for each action $Q(a)$ as advised.

However, the algorithm is still diverging for the simple cart-pole balancing problem. So, I fear my Q-update is wrong. After the initialization, what I have done at each step is the following:

  • Calculate $Q_t(s_t)$ using forward propagation of the NN for all actions.
  • Select a new action, $a_t$, land in a new state $s_t$.
  • Calculate $Q_t(s_{t+1})$ using forward propagation of the NN for all actions.
  • Set the target Q-value as:
    $Q_{t+1}(s_t,a_t)=Q_t(s_t,a_t)+\alpha_t \left[r_{t+1}+\gamma \max_a Q(s_{t+1},a) – Q_t(s_t,a_t) \right]$
    only for the current action, $a_t$, whilst setting $Q_{t+1}(s,a_t)=Q_{t}(s,a_t)$ for the other states. Note, I think this is the problem.
  • Set the error vector to $\mathbf{e}=Q_\mathrm{target}-Q_t=Q_{t+1}-Q_t$
  • Backpropagate the error through the NN in order to update the weight matrices.

Could anyone please point out to me where I have gone wrong?

Besides, do you reckon I should include a bias term as well in the input layer and the first hidden layer (i.e. for the sigmoid functions)? Will it make a difference?

Thank you very much in advance for your help. I can help clarify the question or share code if required.

Best Answer

Your target should be just

$r_{t+1}+\gamma \max_a Q(s_{t+1},a)$.

Note that your error term (which is correct) could then be rewritten as $r_{t+1}+\gamma \max_a Q(s_{t+1},a) - Q_t$

which is the term inside brackets in the update formula. This will get multiplied by your NN learning rate and other backpropagation terms during learning, and then added to the previous weights, just like the $Q$ update formula.