Confused between different form of Bellman equations mentioned in different literatures

dynamic programmingmachine learningoptimizationrecursion

I'm studying reinforcement learning from Prof. Andrew Ng's lecture notes. Here the Bellman equation is mentioned as following:

$$V(s) = R(s) + \gamma\max_a\sum_{s'\in S}P(s'|s,a)V(s')$$

Note that in above equation, the state transition probabilities $P(s'|s,a)$ is not multiplied with the reward function $R(s)$ (which is instant reward).

Now in other reference like this and this, the same Bellman equation is given in other form as following:

$$V(s) = \max_a \sum_{s'\in S} P(s'|s,a)[ R(s) + \gamma V(s')]$$

So what is the difference (intuitive) between above two equations? The first equation I mentioned above makes complete sense to me as explained the Andrew Ng's notes, but I don't understand why in second form of equation, we are multiplying transition probabilities with reward function $R(s)$?

Best Answer

They both are the same.

Start from the second equation \begin{align*} V(s) &= \max_a \sum_{s'\in S} P(s'|s,a)[ R(s) + \gamma V(s')] \\ &=\max_a \left( \sum_{s'\in S} P(s'|s,a) R(s) + \gamma \sum_{s' \in S}P(s'|s,a) V(s')] \right) \\ &=\max_a \left( R(s) \sum_{s'\in S} P(s'|s,a) + \gamma \sum_{s' \in S}P(s'|s,a) V(s')] \right) \\ &=\max_a \left( R(s) \cdot 1 + \gamma \sum_{s' \in S}P(s'|s,a) V(s')] \right) \\ &= \text{rhs of first equation} \end{align*}

Best Answer

Related Solutions

Derivation of Bellman equation for state value function V(s)

Proving a function satisfying Bellman’s equation is optimal

Related Question