Solved – Reinforcement Learning Value Iteration Explained

reinforcement learningself-study

I am having a hard time to understand the value iteration derived from the Bellman equation:

$$V_{k=0}(s) = 0$$

$$\forall s : V_{k+1}(s) = \max_a \Bigr[ R(s,a) + \gamma \sum_{s'} P(s'|s, \pi(s)) V_k(s') \Bigr]$$

I am trying to understand this on a concrete example I have thought of: A maze and a rat in it:

                            +------------------+
                            | s2             s3|
                            +------+ s1 +------+
                                   |    |
                                   |  R |

There are only three states. $s_1$ is a T-junction with the possible actions, left and right. $s_2$ and $s_3$ is food and something else poision or so. Thus granting rewards $(2,0)$. Everything is determinstic and I would like to use a discounting of $\gamma = 0.5$

How would a concrete example look like:

Here are my thoughts:

  • $R(s,a)$ is the reward which would be granted immediately (isn't it?), so I guess for $s_1$ this would be $0$?
  • the sum $\sum_{s'}$ is the sum over the possible transition states, that would be $s_2$ and $s_3$?
  • the transition probability $P(s'|s, \pi(s))$ is not given, but maybe it's just $1/2$

When I try to write it down it does not seem to make a lot sense:

\begin{align}
V_{k=1}(s_1) = 0 + 0.5 + (0.5 \cdot 0 + 0.5 \cdot 0) = 0
\end{align}

This would just become zero again, is that correct?

\begin{align}
V_{k=1}(s_2) = 3 + 0.5 + (0.5 \cdot 0 + 0.5 \cdot 0) = 3
\end{align}

How would the $a$ be evaluated? Its the state of the rat and it should choose the one with the highest result, but what are possible $a$?

Best Answer

Your example is a little weaker than it could be, because there don't really seem to be any actions, however, we can work with that; I'll preserve the $\max_a$ and $R(s,a)$ pieces of notation, although $a$ isn't operative in the example. I'm assuming that from state 1 there is a 50-50 chance of transitioning to either state 2 or state 3, and that the discount rate is also 50%, as per your example. I also assume the rewards $R(s,a)$ for being in states 1, 2, and 3 are 0, 2, and 0 respectively.

Value iteration iterates until the values converge. At the first iteration, as you have, $V_0(s) = 0 \space \forall s$.

At iteration 1:

$V_1(s=3) = \max_a\{R(3,a)\} = 0$ as state 3 is a terminal state, so no transitions.

$V_1(s=2) = \max_a\{R(2,a)\} = 2$ as state 2 is a terminal state, so no transitions.

$V_1(s=1) = \max_a\{R(1,a) + 0.5(0.5*V_0(s=2) + 0.5*V_0(s=3))\} = 0$

At iteration 2:

$V_2(s=3) = \max_a\{R(3,a)\} = 0$ as state 3 is a terminal state, so no transitions.

$V_2(s=2) = \max_a\{R(2,a)\} = 2$ as state 2 is a terminal state, so no transitions.

$V_2(s=1) = \max_a\{R(1,a) + 0.5(0.5*V_1(s=2) + 0.5*V_1(s=3))\} = 0.5$

At iteration 3:

$V_3(s=3) = \max_a\{R(3,a)\} = 0$ as state 3 is a terminal state, so no transitions.

$V_3(s=2) = \max_a\{R(2,a)\} = 2$ as state 2 is a terminal state, so no transitions.

$V_3(s=1) = \max_a\{R(1,a) + 0.5(0.5*V_2(s=2) + 0.5*V_2(s=3))\} = 0.5$

And we have convergence! All the $V_3(s) = V_2(s)$. In a more complex example, with actions included, $V_3(s) = V_2(s)$ implies that the actions selected at iteration 3 and at iteration 2 are either equal or equivalent in terms of value; having part of the algorithm be a fixed way of choosing between actions that are tied, in terms of expected value, reduces the implication to the actions being the same between iterations.

Related Question