I'm studying reinforcement learning from Prof. Andrew Ng's lecture notes. Here the Bellman equation is mentioned as following:
$$V(s) = R(s) + \gamma\max_a\sum_{s'\in S}P(s'|s,a)V(s')$$
Note that in above equation, the state transition probabilities $P(s'|s,a)$ is not multiplied with the reward function $R(s)$ (which is instant reward).
Now in other reference like this and this, the same Bellman equation is given in other form as following:
$$V(s) = \max_a \sum_{s'\in S} P(s'|s,a)[ R(s) + \gamma V(s')]$$
So what is the difference (intuitive) between above two equations? The first equation I mentioned above makes complete sense to me as explained the Andrew Ng's notes, but I don't understand why in second form of equation, we are multiplying transition probabilities with reward function $R(s)$?
Best Answer
They both are the same.
Start from the second equation \begin{align*} V(s) &= \max_a \sum_{s'\in S} P(s'|s,a)[ R(s) + \gamma V(s')] \\ &=\max_a \left( \sum_{s'\in S} P(s'|s,a) R(s) + \gamma \sum_{s' \in S}P(s'|s,a) V(s')] \right) \\ &=\max_a \left( R(s) \sum_{s'\in S} P(s'|s,a) + \gamma \sum_{s' \in S}P(s'|s,a) V(s')] \right) \\ &=\max_a \left( R(s) \cdot 1 + \gamma \sum_{s' \in S}P(s'|s,a) V(s')] \right) \\ &= \text{rhs of first equation} \end{align*}