Solved – Questions about Q-Learning using Neural Networks

machine learningneural networksreinforcement learning

I have implemented Q-Learning as described in,

http://web.cs.swarthmore.edu/~meeden/cs81/s12/papers/MarkStevePaper.pdf

In order to approx. Q(S,A) I use a neural network structure like the following,

  • Activation sigmoid
  • Inputs, number of inputs + 1 for Action neurons (All Inputs Scaled 0-1)
  • Outputs, single output. Q-Value
  • N number of M Hidden Layers.
  • Exploration method random 0 < rand() < propExplore

At each learning iteration using the following formula,

enter image description here

I calculate a Q-Target value then calculate an error using,

error = QTarget - LastQValueReturnedFromNN

and back propagate the error through the neural network.

Q1, Am I on the right track? I have seen some papers that implement a NN with one output neuron for each action.

Q2, My reward function returns a number between -1 and 1. Is it ok to return a number between -1 and 1 when the activation function is sigmoid (0 1)

Q3, From my understanding of this method given enough training instances it should be quarantined to find an optimal policy wight? When training for XOR sometimes it learns it after 2k iterations sometimes it won't learn even after 40k 50k iterations.

Best Answer

Q1. You're definitely on the right track, but a few changes could help immensely. Some people use one output unit per action so that they only have to run their network once for action selection (you have to run your net once for each possible action). But this shouldn't make a difference with regards to learning, and is only worth implementing if you're planning on scaling your model up significantly.

Q2. Generally, people use a linear activation function for the last layer of their neural network, especially for reinforcement learning. There are a variety of reasons for this, but the most pertinent is that a linear activation function allows you to represent the full range of real numbers as your output. Thus, even if you don't know the bounds on the rewards for your task, you're still guaranteed to be able to represent that range.

Q3. Unfortunately, the theoretical guarantees for combining neural networks (and non-linear function approximation for generally) with reinforcement learning are pretty much non-existent. There are a few fancier versions of reinforcement learning (mainly out of the Sutton lab) that can make the sorts of convergence claims you mention, but I've never really seen those algorithms applied 'in the wild'. The reason for this is that while great performance can't be promised, it is typically obtained in practice, with proper attention to hyper-parameters and initial conditions.

One final point that bears mentioning for neural networks in general: don't use sigmoid activation functions for networks with a lot of hidden layers! They're cursed with the problem of 'vanishing gradients'; the error signal hardly reaches the earlier layers (looking at the derivative of the function should make it clear why this is the case). Instead, try using rectified linear units (RELU) or 'soft plus' units, as they generally exhibit much better performance in deep networks.

See this paper for a great implementation of neural networks trained with reinforcement learning:

Mnih, Volodymyr, et al. "Playing Atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).