Solved – The value of a terminal state in reinforcement learning

reinforcement learning

I am working in an environment where each transition rewards 0 except for the transitions into the terminal state, which reward 1.

My question is, how should I define the value of the terminal state? Right now I just initialize all values to be 0, but since the terminal state has no transitions it remains 0 while the states around it have positive value. So if I want to improve my policy by making it greedy in respect to the neighbor states, the states next to the terminal states won't want to choose the terminal state (since there are positive non-terminal states neighboring it).

Does this sort of environment just not work with state-value dynamic programming methods of reinforcement learning? I don't see how I can make this work.

Best Answer

My question is, how should I define the value of the terminal state?

The state value of the terminal state in an episodic problem should always be zero. The value of a state is the expected sum of all future rewards when starting in that state and following a specific policy. For the terminal state, this is zero - there are no more rewards to be had.

So if I want to improve my policy by making it greedy in respect to the neighbor states, the states next to the terminal states won't want to choose the terminal state (since there are positive non-terminal states neighboring it)

You have not made it 100% clear here, but I am concerned that you might be thinking the greedy policy is chosen like this: $\pi(s) = \text{argmax}_a [\sum_{s'} p(s'|s, a) v(s') ]$ - where $v(s)$ is your state value function, and $p(s'|s, a)$ is the probability of transition to state $s'$ given starting state $s$ and action $a$ (using same notation as Sutton & Barto, 2nd edition). That is not the correct formula for the greedy action choice. Instead, in order to maximise reward from the next action you take into account the immediate reward plus expected future rewards from the next state (I have added in the commonly-used discount factor $\gamma$ here):

$$\pi(s) = \text{argmax}_a [\sum_{s',r} p(s',r|s, a)(r + \gamma v(s')) ]$$

If you are more used to seeing transition matrix $P_{ss'}^a$ and expected reward matrix $R_{ss'}^a$, then the same formula using those is:

$$\pi(s) = \text{argmax}_a [\sum_{s'} P_{ss'}^a( R_{ss'}^a + \gamma v(s')) ]$$

When you use this greedy action choice, then the action to transition to the terminal state is at least equal value to other choices.

In addition, your specific problem has another issue, related to how you have set the rewards.

I am working in an environment where each transition rewards 0 except for the transitions into the terminal state, which reward 1.

Does this sort of environment just not work with state-value dynamic programming methods of reinforcement learning? I don't see how I can make this work.

Recall that state values are the value of the state only assuming a specific policy. Answers to your problem are going to depend on the type of learning algorithm you use, and whether you allow stochastic or deterministic policies. There should always be at least one state with at least some small chance of transitioning to the terminal state, in order for any value other than 0 for all the other states. That should be guaranteed under most learning algorithms. However, many of these algorithms could well learn convoluted policies which choose not to transition to the terminal state when you would expect/want them to (without knowing your problem definition, I could not say which that is intuitively).

Your biggest issue is that with your reward structure, you have given the agent no incentive to end the episode. Yes it can get a reward of 1, but your reward scheme means that the agent is guaranteed to get that reward eventually whatever it does, there is no time constraint. If you applied a learning algorithm - e.g. Policy Iteration - to your MDP, you could find that all states except the terminal state have value of 1, which the agent will get eventually once it transitions to the terminal state. As long as it learns a policy where that happens eventually, then as far as the agent is concerned, it has learned an optimal policy.

If you want to have an agent that solves your MDP in minimal time, in an episodic problem, it is usual to encode some negative reward for each time step. A basic maze solver for instance, typically gives reward -1 for each time step.

An alternative might be to apply a discount factor $0 \lt \gamma \lt 1$ - which will cause the agent to have some preference for immediate rewards, and this should impact the policy so that the step to the terminal state is always taken.

Related Question