Derivation of Bellman equation for state value function V(s)

expected valuelinear algebramachine learning

I'm studying reinforcement learning from Richard S. Sutton book, where the derivation of Bellman equation is given as following:

$$v_\pi(s) = E_\pi(R_{t+1} + \gamma G_{t+1} | S_t = s)$$
$$=\sum_a \pi(a|s)\sum_{s'}\sum_rp(s', r|s, a)[r + \gamma E_\pi([G_{t+1}|S_{t+1} = s])]$$

I don't understand how this equation arrived from first one. here is one another resource explaining the same. But at both the places the same equation is derived. So What I'm missing here?

Best Answer

If we follow policy $\pi$, given that we are at state $s$.

Then with probability $\pi(a|s)$, we take action $a$.

Given that we are at state $s$ and $a$, with probability $p(s',r|s,a)$, we receive reward $r$ and arrive at state $s'$.

That is by the law of total expectation,

\begin{align} &E_\pi (R_{t+1}+\gamma G_{t+1}|S_t=s) \\&= \sum_{r,s', a} E_\pi (R_{t+1}+\gamma G_{t+1}|R_{t+1}=r, S_{t+1}=s', A=a,S_t=s)Pr(R_{t+1}=r, S_{t+1}=s', A=a|S_t=s) \\ &=\sum_{r,s', a} Pr(R_{t+1}=r, S_{t+1}=s', A=a|S_t=s)E_\pi (R_{t+1}+\gamma G_{t+1}|R_{t+1}=r, S_{t+1}=s', A=a,S_t=s) \\ &=\sum_{r,s', a} Pr(A=a|S_t=s) Pr(R_{t+1}=r, S_{t+1}=s'|A=a, S_t=s)E_\pi (r+\gamma G_{t+1}| S_{t+1}=s', A=a,S_t=s) \\ &=\sum_{r,s', a} \pi(a|s)p(s',r|s,a)E_\pi (r+\gamma G_{t+1}| S_{t+1}=s'), \\ &=\sum_{r,s', a} \pi(a|s)p(s',r|s,a)[r+\gamma E_\pi (G_{t+1}| S_{t+1}=s')] \end{align}

Related Solutions

[Math] Implementing temporal difference learning for a random walk in Python

I think you're double-counting on the update_value_table_future_reward function when you reach a terminal state.

def update_value_table_future_reward(self, old_position, new_position):
    if self.nodes[new_position] in ("Left","Right"):
        reward = self.get_reward(new_position)
    else:
        reward = self.get_reward(new_position) + self.gamma * self.values[new_position]
    self.values[old_position] += self.alpha * (reward - self.values[old_position])

After 5,000 episodes, I get the below values:

[ 0.163 0.357 0.557 0.67 0.852]

Confused between different form of Bellman equations mentioned in different literatures

They both are the same.

Start from the second equation \begin{align*} V(s) &= \max_a \sum_{s'\in S} P(s'|s,a)[ R(s) + \gamma V(s')] \\ &=\max_a \left( \sum_{s'\in S} P(s'|s,a) R(s) + \gamma \sum_{s' \in S}P(s'|s,a) V(s')] \right) \\ &=\max_a \left( R(s) \sum_{s'\in S} P(s'|s,a) + \gamma \sum_{s' \in S}P(s'|s,a) V(s')] \right) \\ &=\max_a \left( R(s) \cdot 1 + \gamma \sum_{s' \in S}P(s'|s,a) V(s')] \right) \\ &= \text{rhs of first equation} \end{align*}

Best Answer

Related Solutions

[Math] Implementing temporal difference learning for a random walk in Python

Confused between different form of Bellman equations mentioned in different literatures

Related Question