The answer is tucked in the abstract (emphasis mine): "We prove that, given training under any $\epsilon$-soft policy [...] to the action-value function for an arbitrary target policy."
The authors also write, "We always use $\pi$ to denote the target policy, the policy that we are learning about."
Epsilon-soft is, as defined here, synonymous with epsilon-greedy:
An epsilon-soft policy chooses one action with probability $(1 - \epsilon + \frac{\epsilon}{|A(s)|)}$ and all other actions with probability $\frac{\epsilon}{|A(s)|)}$, where $|A(s)|$ is the number of actions in state $s$. This is equivalent to saying that the policy is to choose a random action with some small probability epsilon, and another single (dominant) action otherwise.
Wiki defines epsilon-greedy similarly:
One such method is $\epsilon$-greedy, when the agent chooses the action that it believes has the best long-term effect with probability $1-\epsilon$, and it chooses an action uniformly at random, otherwise.
That is, there are no restrictions on the target policy $\pi$; $b$ is epsilon-soft.
My question is, how should I define the value of the terminal state?
The state value of the terminal state in an episodic problem should always be zero. The value of a state is the expected sum of all future rewards when starting in that state and following a specific policy. For the terminal state, this is zero - there are no more rewards to be had.
So if I want to improve my policy by making it greedy in respect to the neighbor states, the states next to the terminal states won't want to choose the terminal state (since there are positive non-terminal states neighboring it)
You have not made it 100% clear here, but I am concerned that you might be thinking the greedy policy is chosen like this: $\pi(s) = \text{argmax}_a [\sum_{s'} p(s'|s, a) v(s') ]$ - where $v(s)$ is your state value function, and $p(s'|s, a)$ is the probability of transition to state $s'$ given starting state $s$ and action $a$ (using same notation as Sutton & Barto, 2nd edition). That is not the correct formula for the greedy action choice. Instead, in order to maximise reward from the next action you take into account the immediate reward plus expected future rewards from the next state (I have added in the commonly-used discount factor $\gamma$ here):
$$\pi(s) = \text{argmax}_a [\sum_{s',r} p(s',r|s, a)(r + \gamma v(s')) ]$$
If you are more used to seeing transition matrix $P_{ss'}^a$ and expected reward matrix $R_{ss'}^a$, then the same formula using those is:
$$\pi(s) = \text{argmax}_a [\sum_{s'} P_{ss'}^a( R_{ss'}^a + \gamma v(s')) ]$$
When you use this greedy action choice, then the action to transition to the terminal state is at least equal value to other choices.
In addition, your specific problem has another issue, related to how you have set the rewards.
I am working in an environment where each transition rewards 0 except for the transitions into the terminal state, which reward 1.
Does this sort of environment just not work with state-value dynamic programming methods of reinforcement learning? I don't see how I can make this work.
Recall that state values are the value of the state only assuming a specific policy. Answers to your problem are going to depend on the type of learning algorithm you use, and whether you allow stochastic or deterministic policies. There should always be at least one state with at least some small chance of transitioning to the terminal state, in order for any value other than 0 for all the other states. That should be guaranteed under most learning algorithms. However, many of these algorithms could well learn convoluted policies which choose not to transition to the terminal state when you would expect/want them to (without knowing your problem definition, I could not say which that is intuitively).
Your biggest issue is that with your reward structure, you have given the agent no incentive to end the episode. Yes it can get a reward of 1, but your reward scheme means that the agent is guaranteed to get that reward eventually whatever it does, there is no time constraint. If you applied a learning algorithm - e.g. Policy Iteration - to your MDP, you could find that all states except the terminal state have value of 1, which the agent will get eventually once it transitions to the terminal state. As long as it learns a policy where that happens eventually, then as far as the agent is concerned, it has learned an optimal policy.
If you want to have an agent that solves your MDP in minimal time, in an episodic problem, it is usual to encode some negative reward for each time step. A basic maze solver for instance, typically gives reward -1 for each time step.
An alternative might be to apply a discount factor $0 \lt \gamma \lt 1$ - which will cause the agent to have some preference for immediate rewards, and this should impact the policy so that the step to the terminal state is always taken.
Best Answer
The article you are reading is not using terminology correctly, and the initial systems that it uses to demonstrate concepts are not MDPs, but other related systems with the Markov property.
More specifically, this:
does not desribe a Markov Decision Process. There is no decision. A MDP is a Markov process where decisions are made which affect the outcome. These decisions are usually framed as choosing from a set of allowed actions in the state.
Extending the notation from your quote, the following describes state transitions in a MDP:
$$P_{ss'}^a = \mathbb{P}\{S_{t+1} = s' | S_t = s, A_t = a\}$$
i.e. you can define a transition matrix between states $s \rightarrow s'$, per action choice $a$.
That is part of the definition of MDP. Without an action choice, you don't have a MDP, but some other, possibly related process.
No. The policy defines action choice, and is typically something that can be evaluated or modified within the context of an environment. The policy is an entirely separate probability table to the transition matrix, but will interact with it to create distributions of states and rewards when an agent following the policy acts in the environment.
Usually the state transition matrix represents the rules of the environment that cannot be changed, whilst the policy may be under your control. The policy could be optimised by making "best" action choices under some measure - usually a sum over expected future rewards. That control setting is not the only use of MDPs in reinforcement learning, but it is the main one.