Solved – Why does value iteration converge to the optimal value function in **finite** amount of steps

machine learningreinforcement learning

Recall reinforcement learning at (specially for notation):

http://cs229.stanford.edu/notes/cs229-notes12.pdf

Assume that we have a finite MDP. Why is it that value iteration converges to the optimal value function in a finite amount of steps? Can we express this runtime asymptotically precisely in terms of variable (maybe the number of states etc)?


Let me share with you my thoughts:

But first Recall value iteration:

enter image description here

and recall Bellman equations:

enter image description here

If we accept that Bellman's equations are in fact optimal, then its obvious why value iteration gets to the optimal value function once it converges (if it converges). Once it converges then:

$$ V^*(s) = R(s) + \underset{a \in A}{max} \sum_{s' \in S} P_{sa}(s')V^*(s')$$

should be true so we are done! However, its not entirely clear if it gets to that, i.e. if the update rule does that (even in an infinite number of steps).

I was hoping that the following would be true (but I am not sure):

The update rule only ever increases $V(s)$.

If that is true, then its clear that eventually it would reach the maximum $V(s)$. However, its not clear that even if that were true, that it would do it in a finite amount of steps or even an infinite. It would be nice to see a proof of this, anyone has one?

Every time I try it I get stuck with that question, does the update rule actually help?

Intuitively, it is trying to force Bellman's equation to be true, but, does it actually make it true eventually? how long does it take? Whats the run time?


Also, the update can be implemented in different ways, so choose whatever way you want. To know what I mean by that read the following:

enter image description here

Best Answer

According to Section VI in "An Introduction To Reinforcement Learning" written by Sutton and Barto, this is an experienced conclusion, not a strictly-proved theorem. So it implies that they believe in some special MDP model cases, it's possible that the convergence couldn't happen. Please correct me if my understanding is wrong.

Related Question