Solved – Reinforcement Learning – difference between a Policy and a State transition matrix

reinforcement learning

https://towardsdatascience.com/getting-started-with-markov-decision-processes-reinforcement-learning-ada7b4572ffb

A state transition probability tells us, given we are in state s, what the probability the next state $s'$ will occur.

$P_{ss'} = P[S_{t+1} = s' | S_t = s]$

We can also define all state transitions in terms of a state transition matrix where each row now tells us the transition probabilities from one state to all possible successor states.

$P = \begin{bmatrix}
P_{11} & P_{12} & \dots \\
\vdots & \ddots & \\
P_{K1} & & P_{KK}
\end{bmatrix}$

A policy is a distribution over actions given states. Policies give the mappings from one state to the next.

$\pi(a|s) = P[A_t = a | S_t = s]$

My question is: why do we need the variable A and a to describe the action? Isn't the policy simply the state transition matrix? Why can't the policy simply be written as

$\pi(a|s) = P[S_{t+1} = s' | S_t = s]$

Best Answer

The article you are reading is not using terminology correctly, and the initial systems that it uses to demonstrate concepts are not MDPs, but other related systems with the Markov property.

More specifically, this:

A state transition probability tells us, given we are in state s, what the probability the next state $s'$ will occur.

$P_{ss'} = P[S_{t+1} = s' | S_t = s]$

does not desribe a Markov Decision Process. There is no decision. A MDP is a Markov process where decisions are made which affect the outcome. These decisions are usually framed as choosing from a set of allowed actions in the state.

Extending the notation from your quote, the following describes state transitions in a MDP:

$$P_{ss'}^a = \mathbb{P}\{S_{t+1} = s' | S_t = s, A_t = a\}$$

i.e. you can define a transition matrix between states $s \rightarrow s'$, per action choice $a$.

My question is: why do we need the variable A and a to describe the action?

That is part of the definition of MDP. Without an action choice, you don't have a MDP, but some other, possibly related process.

Isn't the policy simply the state transition matrix?

No. The policy defines action choice, and is typically something that can be evaluated or modified within the context of an environment. The policy is an entirely separate probability table to the transition matrix, but will interact with it to create distributions of states and rewards when an agent following the policy acts in the environment.

Usually the state transition matrix represents the rules of the environment that cannot be changed, whilst the policy may be under your control. The policy could be optimised by making "best" action choices under some measure - usually a sum over expected future rewards. That control setting is not the only use of MDPs in reinforcement learning, but it is the main one.

Related Question