Solved – Convergence of Value Iteration in Reinforcement Learning

reinforcement learning

I am struggling to understand when the value iteration algorithm converge.

Suppose that I use a discount 1.

Will the algorithm always converge if I have terminal states in the MDP or it has to do on how I define the rewards?

I understand that if I have terminal states in the MDP the MDP is episodic. So will end sometime. So, do I have to use a discount smaller than 1 to converge?

Best Answer

There are a few requirements for Value Iteration to guarantee convergence:

  • State space and action space should be finite

  • Reward values should have an upper and lower bound

  • Environment should be episodic or if continuous then discount factor should be less than 1

  • The value function should be represented as a table, one entry per state. If you use function approximation over state vectors, then value iteration can be unstable, just like Q Learning.

I understand that if I have terminal states in the MDP the MDP is episodic. So will end sometime.

That is not necessarily a strong enough guarantee. If there are loops in state possible, where an agent could either get stuck gaining negative rewards, or where it could gain infinite positive reward over time, then you may have a problem with episodes not ending.

So, do I have to use a discount smaller than 1 to converge?

For a strictly episodic problem - no opt-in loops with positive reward, no "trap" loops with negative reward - then no. Discounting is not required in that case.