Solved – How to fit weights into Q-values with linear function approximation

feature selectionmachine learningreinforcement learning

In reinforcement learning, linear function approximation is often used when large state spaces are present. (When look up tables become unfeasible.)

The form of the $Q-$value with linear function approximation is given by

$$Q(s,a) = w_1 f_1(s,a) + w_2 f_2(s,a) + \cdots, $$

where $w_i$ are the weights, and $f_i$ are the features.

The features are predefined by the user. My question is, how are the weights assigned?

I have read/downloaded some lecture slides on $Q-$learning with function approximation. Most of them have slides on linear regression that follow. Since they are just slides, they tend to be incomplete. I wonder what the connection/relation is between the two topics.

Best Answer

Function approximation is basically a regression problem (in the general sense, i.e. opposed to classification where the class is discrete), i.e. one tries to learn a function mapping from input (in your case $f(s,a)$) to a real-valued output $Q(s,a)$. Since we do not have a full table of all input / output values, but instead learn and estimate $Q(s,a)$ at the same time, the parameters (here: the weights $w$) cannot be calculated directly from the data. A common approach here is to use gradient descent.

Here is the general algorithm for learning $Q(s,a)$ with Value Function Approximation

  • Init parameter-vector $w=(w_1,w_2,....,w_n)$ randomly (e.g. in [0,1])
  • For each episode:

    1. $s\leftarrow$initial state of episode
    2. $a\leftarrow$action given by policy $\pi$ (recommend: $\epsilon$-greedy)
    3. Take action $a$, observe reward $r$ and next state $s'$
    4. $w\leftarrow w+ \alpha(r+\gamma * max_{a'}Q(s',a') - Q(s,a)) \vec\nabla_wQ(s,a)$
    5. $s\leftarrow s'$

    Repeat 2-5 until $s$ is terminal

where ...

  • $\alpha\in[0,1]$ is the learning rate
  • $\gamma\in[0,1]$ is the discount rate
  • $max_{a'}Q(s',a')$ is the action $a'$ in state $s'$ maximizing $Q(s',a)$
  • $\vec\nabla_wQ(s,a)$ is the gradient of $Q(s,a)$ in $w$. In your linear case, the gradient is simply a vector $(f_1(s,a),...,f_n(s,a))$

The parameters/weights-update (4th step) can be read in such a way:

  • $(r+\gamma * max_a'Q(s',a')) - (Q(s,a))$ is the error between prediction $Q(s,a)$ and the "actual" value for $Q(s,a)$, which is the reward $r$ obtained now PLUS the expected, discounted reward following the greedy policy afterwards $\gamma * max_a'Q(s',a')$
  • So the parameter/weight-vector is shifted into the steepest direction (given by the gradient $\vec\nabla_wQ(s,a)$) by the amount of the measured error, adjusted by $\alpha$.

Main Source:

Chapter 8 Value Approximation of the (overall recommended) book Reinforcement Learning: An Introduction by Sutton and Barto (First Edition). The general algorithm has been modified as it is commonly done to calculate $Q(s,a)$ instead of $V(s)$. I have also dropped the eligibility traces $e$ to focus on gradient descent, hence using only one-step-backups

More references