Solved – Uniqueness of the optimal value function for an MDP

markov-decision-processreinforcement learning

Suppose we have a Markov decision process with a finite state set and a finite action set. We calculate the expected reward with a discount of $\gamma \in [0,1]$.

In chapter 3.8 of the book "Reinforcement Learning: An Introduction" (by Andrew Barto and Richard S. Sutton) it is stated that there always exists at least one optimal policy, but it doesn't prove why.

I suppose the various optimal policies yield the same optimal value function, at least this is what would make sense and also assumed in the book.

Can someone give me a proof for the above statement or a link to a proof?

Best Answer

Assume that there exists two optimal policies $\pi$ and $\pi'$ with respective value functions $V$ and $V'$. Assume that, for some state $x$, $V(x) \neq V'(x)$. Without loss of generality, we can assume that $V(x) < V'(x)$. But if $V(x)$ is lower than $V'(x)$, then it is not optimal since it is better to follow $\pi'$. Therefore, it is proved by contradiction, $V(x)$ and $V'(x)$ must be the same.

Related Question