In the summary of notation (page xvi) of Sutton and Barto's book, they define $S_t$ as:
state at time $t$, typically due, stochastically, to $S_{t-1}$ and $A_{t-1}$
This is similar to what you observed:
However, what confuses me is that in order to properly define $S_t$ for $t>0$ it seems necessary to first define all the $S_i, A_i$, for $0≤i<t$.
The main difference between the book's definition and your observation is that they only take $i = t-1$, not $0 \leq i < t$. Only that single prior step is sufficient due to the Markov property which is basically assumed to hold throughout the entire book, we're almost always talking about Markov Decision Processes.
Another difference is that $S_t$ does not require $S_{t-1}$ and $A_{t-1}$ to be well-defined necessarily, those values only happen to typically explain how we ended up where we are now ($S_t$). The obvious exception is the initial state $S_0$, which we simply end up in... out of the blue kind of.
Hence, it seems that the definition of $S_t$ only makes sense if we specify with which policy we're choosing our actions at all the time-steps before we reached time-step $t$.
This is not necessary. An agent could theoretically change their policies during an episode too. In fact, as a random variable, $S_t$ doesn't even really represent just a single value. It's a symbol that we use to denote the state that we happen to be in during some episode at time $t$, without caring about what we did before that or plan to do after it. See the wikipedia page on random variables.
\begin{equation}
Q_\pi (s,a) = \mathbb{E}_\pi \left[ G_t \mid S_t = s, A_t = a \right]
\end{equation}
This equation simply says that the value of $Q_\pi$ is equal to the returns that we expect to obtain if:
- we start following policy $\pi$ from now on (it doesn't matter which policy we've been following up until now)
- we happen to currently be in state $s$ (this is a specific value, this is no longer a random variable), and we happen to have chosen to select action $a$.
Note how the definition does not depend directly on what policy we've been using in the past. It only depends on the policy we've been using in the past indirectly, in the sense that that explains how we may have ended up in state $S_t = s$. But we do not require knowledge of our past policy to properly define anything in this equation, and that past policy will often not even be the only requirement for a complete explanation of how we ended up where we happen to be now. For example, in nondeterministic environments, we may also require knowledge of a random seed and a Random Number Generator to completely explain why we are where we are now. But the definition of the equation does not rely on the ability to explain this. We just take for granted that, at time $t$, we are in state $S_t = s$, and the equation is well-defined from there.
This equation happens to rely on the future policy $\pi$, but that may be a different policy from our past policy, and this reliance is denoted by the subscript on $Q_\pi$ and $\mathbb{E}_\pi$.
Here is the one-step Q-learning update rule as you gave it:
\begin{equation}Q(s_t, a_t) = Q(s_t, a_t) + \alpha \left[ R_{t+1} + \gamma \max_a Q(s_{t+1}, a) - Q(s_t, a_t) \right]\end{equation}
That update rule actually matches exactly what the "target policy" (greedy policy in this case) is doing; we update the $Q$ value for the state-action pair for which we've just obtained a new observation ($s_t, a_t$, with the new observation being $R_{t+1}$) under the assumption that we follow up with the greedy / target policy immediately afterwards (resulting in $\max_a Q(s_{t+1}, a)$). In this equation, the only action that we may not have taken according to the target policy is the action $a_t$, but that's fine because precisely that same action is the one for which we're updating the $Q$-value.
Now suppose that we try writing a multi-step (or two-step) update rule naively, without importance sampling. That would look as follows:
\begin{equation}Q(s_t, a_t) = Q(s_t, a_t) + \alpha \left[ R_{t+1} + \gamma R_{t+2} + \gamma^2 \max_a Q(s_{t+2}, a) - Q(s_t, a_t) \right]\end{equation}
This update rule assumes that our behaviour policy (typically something like $\epsilon$-greedy) was used to select another action $a_{t+1}$, resulting in an additional reward observation $R_{t+2}$ and state $s_{t+2}$. In this update rule, we suddenly have a problem for off-policy learning, because our update rule uses a reward $R_{t+2}$ that is the result of an action $a_{t+1}$ which our target policy may not have selected, and it's a different action ($a_t$) for which we're actually updating the $Q$-value. Using information from "incorrect" actions is fine if it's only used to update $Q$-values for those same "incorrect" actions, but it's not fine if we're using it to update $Q$-values for a different ("correct") action (an action that would also have been selected by the greedy / target policy).
Best Answer
The answer is tucked in the abstract (emphasis mine): "We prove that, given training under any $\epsilon$-soft policy [...] to the action-value function for an arbitrary target policy."
The authors also write, "We always use $\pi$ to denote the target policy, the policy that we are learning about."
Epsilon-soft is, as defined here, synonymous with epsilon-greedy:
Wiki defines epsilon-greedy similarly:
That is, there are no restrictions on the target policy $\pi$; $b$ is epsilon-soft.