The article you are reading is not using terminology correctly, and the initial systems that it uses to demonstrate concepts are not MDPs, but other related systems with the Markov property.
More specifically, this:
A state transition probability tells us, given we are in state s, what the probability the next state $s'$ will occur.
$P_{ss'} = P[S_{t+1} = s' | S_t = s]$
does not desribe a Markov Decision Process. There is no decision. A MDP is a Markov process where decisions are made which affect the outcome. These decisions are usually framed as choosing from a set of allowed actions in the state.
Extending the notation from your quote, the following describes state transitions in a MDP:
$$P_{ss'}^a = \mathbb{P}\{S_{t+1} = s' | S_t = s, A_t = a\}$$
i.e. you can define a transition matrix between states $s \rightarrow s'$, per action choice $a$.
My question is: why do we need the variable A and a to describe the action?
That is part of the definition of MDP. Without an action choice, you don't have a MDP, but some other, possibly related process.
Isn't the policy simply the state transition matrix?
No. The policy defines action choice, and is typically something that can be evaluated or modified within the context of an environment. The policy is an entirely separate probability table to the transition matrix, but will interact with it to create distributions of states and rewards when an agent following the policy acts in the environment.
Usually the state transition matrix represents the rules of the environment that cannot be changed, whilst the policy may be under your control. The policy could be optimised by making "best" action choices under some measure - usually a sum over expected future rewards. That control setting is not the only use of MDPs in reinforcement learning, but it is the main one.
Best Answer
The state value of the terminal state in an episodic problem should always be zero. The value of a state is the expected sum of all future rewards when starting in that state and following a specific policy. For the terminal state, this is zero - there are no more rewards to be had.
You have not made it 100% clear here, but I am concerned that you might be thinking the greedy policy is chosen like this: $\pi(s) = \text{argmax}_a [\sum_{s'} p(s'|s, a) v(s') ]$ - where $v(s)$ is your state value function, and $p(s'|s, a)$ is the probability of transition to state $s'$ given starting state $s$ and action $a$ (using same notation as Sutton & Barto, 2nd edition). That is not the correct formula for the greedy action choice. Instead, in order to maximise reward from the next action you take into account the immediate reward plus expected future rewards from the next state (I have added in the commonly-used discount factor $\gamma$ here):
$$\pi(s) = \text{argmax}_a [\sum_{s',r} p(s',r|s, a)(r + \gamma v(s')) ]$$
If you are more used to seeing transition matrix $P_{ss'}^a$ and expected reward matrix $R_{ss'}^a$, then the same formula using those is:
$$\pi(s) = \text{argmax}_a [\sum_{s'} P_{ss'}^a( R_{ss'}^a + \gamma v(s')) ]$$
When you use this greedy action choice, then the action to transition to the terminal state is at least equal value to other choices.
In addition, your specific problem has another issue, related to how you have set the rewards.
Recall that state values are the value of the state only assuming a specific policy. Answers to your problem are going to depend on the type of learning algorithm you use, and whether you allow stochastic or deterministic policies. There should always be at least one state with at least some small chance of transitioning to the terminal state, in order for any value other than 0 for all the other states. That should be guaranteed under most learning algorithms. However, many of these algorithms could well learn convoluted policies which choose not to transition to the terminal state when you would expect/want them to (without knowing your problem definition, I could not say which that is intuitively).
Your biggest issue is that with your reward structure, you have given the agent no incentive to end the episode. Yes it can get a reward of 1, but your reward scheme means that the agent is guaranteed to get that reward eventually whatever it does, there is no time constraint. If you applied a learning algorithm - e.g. Policy Iteration - to your MDP, you could find that all states except the terminal state have value of 1, which the agent will get eventually once it transitions to the terminal state. As long as it learns a policy where that happens eventually, then as far as the agent is concerned, it has learned an optimal policy.
If you want to have an agent that solves your MDP in minimal time, in an episodic problem, it is usual to encode some negative reward for each time step. A basic maze solver for instance, typically gives reward -1 for each time step.
An alternative might be to apply a discount factor $0 \lt \gamma \lt 1$ - which will cause the agent to have some preference for immediate rewards, and this should impact the policy so that the step to the terminal state is always taken.