Solved – Why is the optimal policy in Markov Decision Process (MDP), independent of the initial state

machine learningreinforcement learning

I was following the reinforcement learning lecture notes on CS229 (which can be referenced for the notation I am using in this question):

http://cs229.stanford.edu/notes/cs229-notes12.pdf

and I had a question about the following paragraph:

enter image description here

My specific question was, why is it that $\pi^*$ has the property that its the optimal policy for all states? I guess that he didn't prove it on his notes because its obvious for him, but why is that true? Is there a proof of that somewhere? Or does someone know at least some intuitive argument for it?

I am just very surprised, because it seems very counter intuitive, specially because of the way the optimal value function is defined on page 4.

It defines:

The optimal value function:

$$V^*(s) = \underset{\pi}{max} \ V^{\pi}(s)$$

The way I understand it is that, its the best possible expected sum of discounted rewards that can be attained by using any policy. However, it is seems to be that its a function of s and for each s we maximize over $\pi$. So how come we don't end up with a different optimal $\pi$ for each state?


For more notation relevant to my question read bellow (or read page 4, or from page 4 to page 1 of the notes I linked):

Recall what the value function is:

$$V^{\pi}(s) = E[R(s_0) +\gamma R(s_1) + \gamma^2R(s_2) + \cdots \mid s_0 = s ; \pi]$$

which is the expected sum of discounted rewards upon starting in state s and taking actions according to the given policy $\pi$ (note $\pi$ is not a r.v. but a "fixed" parameter mapping states to actions).

On page 4 of CS229 notes, it defined the following quantities:

Thus, we can re-write bellman's equations with this "best" valued function:

$$V^*(s) = R(s) + \underset{a \in A}{max} \ \gamma\sum_{s' \in S} P_{sa}(s')V^*(s')$$

which says that the best value function for state s is the initial reward plus the reward from the action that maximizes our weighted future pay-off. i.e. plus the reward of doing the best thing now that would make us get the best pay-off in the future.

From that we see that we can get the best policy by "extracting" the best action for each state according to the equation above:

$$ \pi^*(s) = \underset{a \in A}{argmax}\sum_{s' \in S} P_{sa}(s')V^*(s') $$

The it states that for ever state and every policy $\pi$ we have:

$$ V^*(s) = V^{\pi^*}(s) \geq V^{\pi}(s)$$

Best Answer

The intuition behind the argument saying that the optimal policy is independent of initial state is the following:

The optimal policy is defined by a function that selects an action for every possible state and actions in different states are independent.

Formally speaking, for an unknown initial distribution, the value function to maximize would be the following (not conditioned on initial state)

$v^\pi = E[ R(s_0,a_0) + \gamma R(s_1,a_1) + ... | \pi ] $

Thus the optimal policy is the policy that maximizes $v^\pi=x_0^T V^\pi$ where $x_0$ is the vector defined as $x_0(s)=Prob[s_0=s]$ and $V^\pi$ is the vector with $V^\pi (s)$ defined the same as your definition. Thus the optimal policy $\pi ^*$ is a solution of the following optimization

$\max_{\pi} x_0^T V^\pi = \max_{\pi} \sum _{s} x_0(s)V^\pi (s)= \sum _{s} x_0(s) \max_{a\in A_s}V^\pi (s)= \sum _{s} x_0(s)V^* (s)$

where $A_s$ is the set of actions available at state $s$. Note that the third equation is only valid because $x_0(s)\geq 0$ and we can separate the decision policy by selecting independent actions at different states. In other words, if there were constraints for example on $x_t$ (e.g., if the optimization only searches among policies that guarantee $x_t \leq d$ for $t>0$), then the argument is not valid anymore and the optimal policy is a function of the initial distribution.

For more details, you can check the following arXiv report: "Finite-Horizon Markov Decision Processes with State Constraints", by Mahmoud El Chamie, Behcet Acikmese

Related Question