Solved – Updating q-values in q-learning

I have been reading up about Q-learning, and I have a question about how the value iteration is computed.

From my understanding, the algorithm updates the q-value of a state-action pair, after that action has been taken. So let's say that action a moves the state from s to s'. The q-value, q(s,a) is equal to the value of being in state s, plus the discounted maximum q-value across all actions possible from state s'.

However, this seems inefficient to me, because q-values are only updated by looking one step ahead. For example, let's say that the only state that has a reward is state 100, and the agent starts in state 0. When it finally reaches state 100 by random exploration, it will then update the q-value for the state which the agent was in just before it entered state 100.

But this does not update the q-values for state 0. To update this, it will have to update the q-values for all the state-action pairs that create a "link" from states 100 to 0.

Instead, wouldn't it be possible to just "work backwards" from state 100, and update all the q-values for all states from which state 100 can be reached in one step. Then, for each of these states, update the q-values for all the states which can reach these. This would continue until finally, the q-values in state 0 are updated.

Otherwise, the q-values are only ever updated for a state-action pair which are actually selected during exploration, which could take a very long time to converge.

Presumably I am either misunderstanding q-learning, or misunderstanding the practicalities of my proposed solution…. Any help?

Solved – Updating q-values in q-learning

Best Answer

Related Question

Best Answer

Related Solutions

Solved – Why does Q-Learning use epsilon-greedy during testing

Related Question