Solved – Understanding oscillating behaviour when using Q-learning on cart-pole problem

I am fairly new to RL and have been using OpenAI Gym to try implementing a few algorithms I've been learning about.

I've just been trying to get Q-learning working on the cart-pole environment using linear function approximation and I got the following behaviour:

My first thought was that the learning rate is too high and the gradient descent step is overshooting the optimum parameters but when I knocked it down it basically just stretched this graph in the x direction (i.e. it just took a lot longer to do better than 20 and then started oscillating).

My code (88 lines, just python + numpy) is here but here's the basic idea of what I think I'm doing mathematically:

At each timestep $t$ we are in state $S_t$ and choose action $A_t$ according to parameters $\theta_t$.
The features I am using for the linear approximation of $Q(s, a)$ are denoted by $\phi(s, a)$.
There are two actions 0 and 1. We define $\phi(s, 0) = [s, 0, \cdots, 0]^T$ and $\phi(s, 1) = [0, \cdots, 0, s]^T$. I'm a bit suspicious about whether this is a good idea…
We can then define our linear approximation according to some parameters $\theta$ as $\hat{Q}(s,a,\theta)=\theta^T\phi(s,a)$.
We then do:
- Perform $A_t$ and observe new state $S_{t+1}$ and receive reward $R_{t+1}$.
- Calculate $A^*_{t+1}=\text{argmax}_a\hat{Q}(S_{t+1}, a, \theta_t)$ the action we believe is best according to $\hat{Q}$.
- Update $\theta$ using SGD:
  \begin{align*}
  \theta_{t+1}&=\theta_t+\alpha\left[R_{t+1}+\gamma\hat{Q}(S_{t+1}, A^*_{t+1}, \theta_t)-\hat{Q}(S_t,A_t, \theta_t)\right]\nabla_\theta\hat{Q}(S_t,A_t,\theta_t)\\
  &=\theta_t+\alpha\left[R_{t+1}+\gamma\hat{Q}(S_{t+1}, A^*_{t+1}, \theta_t)-\hat{Q}(S_t,A_t, \theta_t)\right]\phi(S_t, A_t)
  \end{align*}
- Finally decide on whether the next action $A_{t+1}$ will be $A^*_{t+1}$ or some random action ($\epsilon$-greedy).

My best guess is that something is wrong with my theta update rather than it just being a matter of tuning hyperparameters. This is partially since I've spent a while trying different hyperparameters and also because when I tried modifying the code to use experience replay it didn't seem to get anywhere at all…

Best Answer

I don't believe your features can work. Disregarding your encoding of the action, you are simply using a linear function of the state to learn the value. The state given by OpenAI gym are positions and velocity. The ideal value function (V) would be symmetric around 0 for theta which a linear function cannot represent. Tile coding or RBF features would work for Cartpole.

Best Answer

Related Solutions

Solved – Q-learning with Neural Network as function approximation

Solved – How to correctly compute $\rho$ in reinforcement learning with importance sampling

Related Question