Solved – Value Iteration For Terminal States in MDP

reinforcement learning

I am a little bit confused regarding the value iteration algorithm.
When I loop over states should I visit the terminal states, if any, or not?

In Sutton's book on page 83 it says that we do not need to visit the terminal states (loop over each state S, not S+) but every other reference does not make distinction between terminal and not terminal states.

Sutton's book:
https://drive.google.com/file/d/1opPSz5AZ_kVa1uWOdOiveNiBFiEOHjkG/view

Best Answer

The point of visiting a state in value iteration is in order to update its value, using the update:

$$v(s) \leftarrow \text{max}_a[\sum_{r,s'} p(s', r|s,a)(r + \gamma v(s'))]$$

First thing to note is that the state value of terminal state $s^T$ is $v(s^T) = 0$, always, since by definition there are no future rewards to accumulate. It definitely would not be a valid calculation that found a possible reward or different next state after a terminal state and updated the value to be non-zero.

You can define things so that it is valid to run the update. If you implement terminal states as "absorbing states" then this means $p(s^T, 0|s^T,*)=1$, and probability of any other state, reward pair is zero, so running the update as above results in updating $0$ to $0$.

In general there is no point updating the value function of a terminal state, although with correct definitions of transition and reward functions there is no harm to do so, it is just wasted calculations.