Derivation of Bellman equation for state value function V(s)

expected valuelinear algebramachine learning

I'm studying reinforcement learning from Richard S. Sutton book, where the derivation of Bellman equation is given as following:

$$v_\pi(s) = E_\pi(R_{t+1} + \gamma G_{t+1} | S_t = s)$$
$$=\sum_a \pi(a|s)\sum_{s'}\sum_rp(s', r|s, a)[r + \gamma E_\pi([G_{t+1}|S_{t+1} = s])]$$

I don't understand how this equation arrived from first one. here is one another resource explaining the same. But at both the places the same equation is derived. So What I'm missing here?

Best Answer

If we follow policy $\pi$, given that we are at state $s$.

Then with probability $\pi(a|s)$, we take action $a$.

Given that we are at state $s$ and $a$, with probability $p(s',r|s,a)$, we receive reward $r$ and arrive at state $s'$.

That is by the law of total expectation,

\begin{align} &E_\pi (R_{t+1}+\gamma G_{t+1}|S_t=s) \\&= \sum_{r,s', a} E_\pi (R_{t+1}+\gamma G_{t+1}|R_{t+1}=r, S_{t+1}=s', A=a,S_t=s)Pr(R_{t+1}=r, S_{t+1}=s', A=a|S_t=s) \\ &=\sum_{r,s', a} Pr(R_{t+1}=r, S_{t+1}=s', A=a|S_t=s)E_\pi (R_{t+1}+\gamma G_{t+1}|R_{t+1}=r, S_{t+1}=s', A=a,S_t=s) \\ &=\sum_{r,s', a} Pr(A=a|S_t=s) Pr(R_{t+1}=r, S_{t+1}=s'|A=a, S_t=s)E_\pi (r+\gamma G_{t+1}| S_{t+1}=s', A=a,S_t=s) \\ &=\sum_{r,s', a} \pi(a|s)p(s',r|s,a)E_\pi (r+\gamma G_{t+1}| S_{t+1}=s'), \\ &=\sum_{r,s', a} \pi(a|s)p(s',r|s,a)[r+\gamma E_\pi (G_{t+1}| S_{t+1}=s')] \end{align}