Solved – Updating q-values in q-learning

machine learningq-learningreinforcement learning

I have been reading up about Q-learning, and I have a question about how the value iteration is computed.

From my understanding, the algorithm updates the q-value of a state-action pair, after that action has been taken. So let's say that action a moves the state from s to s'. The q-value, q(s,a) is equal to the value of being in state s, plus the discounted maximum q-value across all actions possible from state s'.

However, this seems inefficient to me, because q-values are only updated by looking one step ahead. For example, let's say that the only state that has a reward is state 100, and the agent starts in state 0. When it finally reaches state 100 by random exploration, it will then update the q-value for the state which the agent was in just before it entered state 100.

But this does not update the q-values for state 0. To update this, it will have to update the q-values for all the state-action pairs that create a "link" from states 100 to 0.

Instead, wouldn't it be possible to just "work backwards" from state 100, and update all the q-values for all states from which state 100 can be reached in one step. Then, for each of these states, update the q-values for all the states which can reach these. This would continue until finally, the q-values in state 0 are updated.

Otherwise, the q-values are only ever updated for a state-action pair which are actually selected during exploration, which could take a very long time to converge.

Presumably I am either misunderstanding q-learning, or misunderstanding the practicalities of my proposed solution…. Any help?

Best Answer

That's a good idea. What you are thinking of is called eligibility traces. They often improve convergence speed, but this comes at the cost of significantly greater computation and memory cost, as well as increased complexity. The Sutton and Barto book describes them in detail.

Related Question