Solved – TD(λ) and Eligibility Traces over a Continuous State-Action Space

machine learningq-learningreinforcement learning

I've been trying to get a feel for Q-learning and reinforcement learning in general by implementing the algorithms for simplified problems. I was able to get a Q-learning algorithm with TD($\lambda$) to work online using eligibility traces on a problem with a finite set of states and actions.

Now I'm trying to apply Q-learning to a problem with continuous states and actions. I know that the lookup table for Q can be replaced by a function approximator, but extending TD(0) to TD($\lambda$) in an online algorithm seems less trivial.

I've seen it mentioned that eligibility traces can be applied to the weights of the function approximator rather than the state-action space. But I'm unclear on (1) how the calculations are equivalent, and (2) how it solves the problem as the weight space is also continuous.

Additionally, updating according to the eligibility traces seems to be very inefficient in this context. If I were using a multi-layer perception function approximator, wouldn't I have to run multiple backpropagations in each time step for the different trace updates?

How are these problems dealt with in practice?

A couple of ideas I have that may or may not be legitimate:

  1. Keep the eligibility trace as a lookup table that is reset between episodes (enforce episodes even if they are artificial to the problem by terminating at some given time step?). Though this doesn't really solve the backprop issue unless the episodes are very small.

  2. Keep a running queue of the past n visited state-actions and only apply the eligibility trace updates to those state-actions. With $\lambda = 0.9$, after ~20 iterations the updates become so marginal; couldn't the rest just be thrown away and still maintain a good approximation of the TD($\lambda$) return?

Any ideas/clues/hints would be helpful.

Best Answer

I am also trying the same this as you have asked. I found this paper Replacing eligibility trace for action-value learning with function approximation which explains in details but don't provide any pseudo code. But, in my view its better to use experience reply techniques instead of eligibility trace in case of continuous state space. Anyway, if you want to implement eligibility trace, then you can proceed using separate function approximator for eligibility trace and separate for the Q-function.

Related Question