Solved – Confused about Function Approximation for Q Learning

neural networksq-learningreinforcement learning

I am not sure that I understood Funtion approximation for Q Learning. So basicall with FA we don't use tables anymore? Each state is now represented with features, and we multiply those features with weights to get a Q value.

So the weights are what we are trying to optimize with various functions. I saw examples with linear funtions but not with other functions.

My first question is, how do we chose features? Do we have to hard code them, or is htere a way to generate features?

Is Q learning with Neural Networks function appriximation?

Thank you, sorry if those questions may seem trivial.

Best Answer

My first question is, how do we chose features?

Carefully through intuition about the problem, and experimentation.

Do we have to hard code them, or is there a way to generate features?

There are a few standard schemes which often work well with linear approximations:

  • Discretisation

  • Tiling

  • Fourier transforms

  • Radial basis functions (or other kernel-like functions)

It is true that a lot of problems can be solved by suitable choice or transformation of state and action representation plus a linear approximation function. However, it is not always practical. For instance, it would be very hard to figure out the correct basis functions for agents whose input requires computer vision.

Is Q learning with Neural Networks function approximation?

Yes. It has the added benefit that you can use the raw observations as input, and discovering features that work is part of the learning process. It has the disadvantage that those features (as exhibited in the hidden layers of the network) are unlikely to be easy to interpret.

To understand how this works in practice*, you could take a look at how DQN uses convolutional neural network to approximate Q functions. But in essence, instead of working with the TD Error, like this example from Q-learning:

$$\delta = R + \gamma \text{max}_{a'}[Q(S', a')] - Q(S,A)$$

You work with the TD Target:

$$\nu = R + \gamma \text{max}_{a'}[Q(S', a')]$$

as the "supervised learning" label, and train the neural network to associate this value with the input state. You could take the same approach with linear regression gradient algorithms (calculate TD Target and then train a standard linear regression), just the linear version written out with gradients already resolved in a single-step update seems to be popular approach in e.g. Sutton & Barto.

In practice you will want to normalise input values, as you would with a supervised learning problem. Also, there are a few other implementation details for stability, such as experience replay and using two copies of the network.


* It is also possible to use backpropagation directly from the TD Error approach. Essentially it is the same. In practice a lot of examples will use the TD Target because then it is possible to build e.g. a Keras model with MSE loss and train it using mini-batch updates already supported by the framework.

Related Question