Solved – In reinforcement learning, what is the formal definition of the symbols $S_t$ and $A_t$

notationreinforcement learning

In both the reinforcement learning course given by David Silver and the latest draft of Richard S. Sutton's RL book what is the formal definition of the symbols $S_t$ and $A_t$?

Do these definitions depend on the policy being used until time-step $t$?

Some context for the question:

I realize that this question might seem trivial, since the authors explicitly define these variables in their textbook/lecture. However, I'm currently trying to make sense of their definitions, but no interpretation I think of seems to yield a coherent/consistent notation. What follows is my line of thought, showing which interpretations I tried and why they seem to be inconsistent.

These authors seem to define the following symbols like so:

$$S_t\triangleq\text{The state we visit at time-step }t.$$
$$A_t\triangleq\text{The action we take at time-step }t.$$

Where both $S_t$ and $A_t$ are commonly treated as random variables.

However, what confuses me is that in order to properly define $S_t$ for $t>0$ it seems necessary to first define all the $S_i,A_i$, for $0\leq i <t$. Hence, it seems that the definition of $S_t$ only makes sense if we specify with which policy we're choosing our actions at all the time-steps before we reached time-step $t$.

This is already ambiguous, and personally confusing, since the symbol $S_t$ has no mention whatsoever to which policy was used to sample the actions until that point. For example, the same symbol $S_5$ can represent completely different random variables, if they're specified in different contexts with different policies being used (or different starting states).

This ambiguity didn't strike me as very impairing, since I thought one could always use, say, a superscript to indicate the policy being used at all the previous time-steps, e.g. $S_5^\pi$. Also, for most of the discussions it was very clear which policy was being used (and almost as clear what was the starting state), so it seemed harmless to drop the extra notation.

However, I later encountered definitions like, for instance, the action-value function of a state action pair:
$$ Q_\pi(s,a)=\mathbb E_\pi[G_t | S_t=s, A_t=a] $$

This is supposed to be defined for all legal actions $a$ in state $s$. However, if I interpret $S_t$ as $S_t^\pi$ and $A_t$ as $A_t^\pi$, respectively, this definition seems to break down when $a$ is an action that policy $\pi$ would never choose, or $s$ is an unreachable state given policy $\pi$ and some starting state $s_0$ (since we'll be conditioning the expectation on an impossible event). So it seems that these authors are not simply dropping the superscript I mentioned before, and instead $S_t$ and $A_t$ have some other definition.

Best Answer

In the summary of notation (page xvi) of Sutton and Barto's book, they define $S_t$ as:

state at time $t$, typically due, stochastically, to $S_{t-1}$ and $A_{t-1}$

This is similar to what you observed:

However, what confuses me is that in order to properly define $S_t$ for $t>0$ it seems necessary to first define all the $S_i, A_i$, for $0≤i<t$.

The main difference between the book's definition and your observation is that they only take $i = t-1$, not $0 \leq i < t$. Only that single prior step is sufficient due to the Markov property which is basically assumed to hold throughout the entire book, we're almost always talking about Markov Decision Processes.

Another difference is that $S_t$ does not require $S_{t-1}$ and $A_{t-1}$ to be well-defined necessarily, those values only happen to typically explain how we ended up where we are now ($S_t$). The obvious exception is the initial state $S_0$, which we simply end up in... out of the blue kind of.

Hence, it seems that the definition of $S_t$ only makes sense if we specify with which policy we're choosing our actions at all the time-steps before we reached time-step $t$.

This is not necessary. An agent could theoretically change their policies during an episode too. In fact, as a random variable, $S_t$ doesn't even really represent just a single value. It's a symbol that we use to denote the state that we happen to be in during some episode at time $t$, without caring about what we did before that or plan to do after it. See the wikipedia page on random variables.

\begin{equation} Q_\pi (s,a) = \mathbb{E}_\pi \left[ G_t \mid S_t = s, A_t = a \right] \end{equation}

This equation simply says that the value of $Q_\pi$ is equal to the returns that we expect to obtain if:

we start following policy $\pi$ from now on (it doesn't matter which policy we've been following up until now)
we happen to currently be in state $s$ (this is a specific value, this is no longer a random variable), and we happen to have chosen to select action $a$.

Note how the definition does not depend directly on what policy we've been using in the past. It only depends on the policy we've been using in the past indirectly, in the sense that that explains how we may have ended up in state $S_t = s$. But we do not require knowledge of our past policy to properly define anything in this equation, and that past policy will often not even be the only requirement for a complete explanation of how we ended up where we happen to be now. For example, in nondeterministic environments, we may also require knowledge of a random seed and a Random Number Generator to completely explain why we are where we are now. But the definition of the equation does not rely on the ability to explain this. We just take for granted that, at time $t$, we are in state $S_t = s$, and the equation is well-defined from there.

This equation happens to rely on the future policy $\pi$, but that may be a different policy from our past policy, and this reliance is denoted by the subscript on $Q_\pi$ and $\mathbb{E}_\pi$.

Best Answer

Related Solutions

Solved – How to correctly compute $\rho$ in reinforcement learning with importance sampling

Solved – When to choose SARSA vs. Q Learning

Related Question