Understanding a statement in Sutton’s Reinforcement Learning Section 5.5 on Importance Sampling

machine learning

I am trying to understand chapter $5.5$ of Sutton's Book on Reinforcement leaning, in a particular a statement on page $104$ related to off policy prediction via importance sampling.

Supposing $b$ is a behavior policy and $\pi$ is a target policy, and that $p$ is the state-transition probability function of an MDP, define

$$\rho_{t:T-1} = \frac{\Pi_{k=1}^{T-1} \pi(A_k|S_k) p(S_{k+1}|S_k, A_k)}{\Pi_{k=1}^{T-1} b(A_k|S_k) p(S_{k+1}|S_k, A_k)} = \Pi_{k=t}^{T-1} (\frac{\pi(A_k, S_k)}{b(A_k|S_k)})$$

where given a starting state $S_t$, we have an action sequence $S_t, A_t, \ldots S_T, A_T$.

Sutton then states that $$ \mathbb{E}[\rho_{t:T-1} G_t |S_t=s] = v_{\pi}(s)$$

I'm assuming this expectation is with respect to $b(a|s)$

I know the RHS should be

$$ \sum_{r, s'} \pi(a|s) p(s', r|s, a) [r + \lambda v_{\pi} (s')]$$

But I am having difficulty understanding the LHS in summation notation or why it is equal to $v_{\pi} (s)$. Any insights appreciated.

Best Answer

As you say, the left-hand side of (5.4) is computed with respect to policy $b$. To get (5.4), you must look at the entire sequence of states and actions, rather than just the next time step. Recall that the definition of $v_\pi$ is, from (3.12), $$ v_{\pi}(s):= \mathbb{E}_\pi[ G_t |S_t=s].$$ To get (5.4), you should truncate $G_t$ to go only up to time step $T$: $$G_t = R_{t+1}+\gamma R_{t+2} + \cdots + \gamma^{T-t-1} R_{T}.$$ Then, by definition, $$v_\pi(s)=\sum \mathbb{E}[G_t\mid S_t, A_t, \dots, A_{T-1}, S_T] \ {\rm Pr}[A_t, S_{t+1}, \dots, A_{T-1}, S_T\mid S_t=s, \hbox{and using policy $\pi$}],$$ where the sum is over all possible sequences of states and actions $A_t$, $S_{t+1}$, $\dots$, $A_{T-1}$, $S_T$. To turn this into an expectation with respect to policy $b$, you need to refactor this as \begin{array}{ll} v_\pi(s)=\displaystyle \sum \mathbb{E}[G_t\mid S_t, A_t, \dots, A_{T-1}, S_T] & \frac{\displaystyle {\rm Pr}[A_t, S_{t+1}, \dots, A_{T-1}, S_T\mid S_t=s, \hbox{and using policy $\pi$}] } {\displaystyle {\rm Pr}[A_t, S_{t+1}, \dots, A_{T-1}, S_T\mid S_t=s, \hbox{and using policy $b$}] } ] \\ & \times {\rm Pr}[A_t, S_{t+1}, \dots, A_{T-1}, S_T\mid S_t=s, \hbox{and using policy $b$}] \end{array} and, if you make a new function for the quotient of probabilities, $$ \rho_{t:T-1}(S_t, A_t, S_{t+1}, \dots, A_{T-1}, S_T):= $$ $$ \frac{\displaystyle {\rm Pr}[A_t, S_{t+1}, \dots, A_{T-1}, S_T\mid \hbox{using the given $S_t$ and policy $\pi$}] } {\displaystyle {\rm Pr}[A_t, S_{t+1}, \dots, A_{T-1}, S_T\mid \hbox{using the given $S_t$ and policy $b$}] }, $$ you can rewrite this as \begin{eqnarray*} v_\pi(s) &=& \sum \mathbb{E}[G_t\mid S_t, A_t, \dots, A_{T-1}, S_T] \ \rho_{t:T-1}(S_t, A_t, S_{t+1}, \dots, A_{T-1}, S_T)\\ &\ &\qquad \times {\rm Pr}[A_t, S_{t+1}, \dots, A_{T-1}, S_T\mid S_t=s, \hbox{and using policy $b$}]\\ &=& \mathbb{E}_b[\rho_{t:T-1} G_t\mid S_t=s], \end{eqnarray*} which is equation (5.4). Then, as Sutton explains, you can break down the numerator and denominator in the definition of $\rho_{t:T-1}$ as a product of probabilities at each of the time steps $t$, $t+1$, $\dots$, $T-1$, which gives you equation (5.3).

Best Answer

Related Solutions

[Math] Cross-Entropy loss in Reinforcement Learning

Writing action-value function in terms of state-value function for a Markov decision process

Related Question