Reinforcement Learning – Understanding the Policy Improvement Theorem

In reinforcement learning, policy improvement is a part of an algorithm called policy iteration, which attempts to find approximate solutions to the Bellman optimality equations. Page-84, 85 in Sutton and Barto's book on RL mentions the following theorem:

Policy Improvement Theorem

Given two deterministic policies $\pi$ and $\pi'$:

$\forall s \in S, V_{\pi}(s) \leq Q_{\pi}(s, \pi'(s))$

RHS of inequality : the agent acts according to policy $\pi'$ in the current state, and for all subsequent states acts according to policy $\pi$

LHS of inequality: the agent acts according to policy $\pi$ starting from the current state.

Claim : $\forall s \in S, V_\pi(s) \leq V_{\pi'}(s)$

In other words, $\pi'$ is an improvement over $pi$.

I have difficulty in understanding the proof. This is discussed below:

Proof :
$$ V_\pi(s) \leq Q_\pi(s, \pi'(s)) = \mathbb{E}_{\pi'}[R_{t+1} + \gamma V_\pi(S_{t+1}) | S_t = s]$$

I am stuck here. The q-function is evaluated over the policy $\pi$. That being the case, how is the expectation over the policy $\pi'$?

My guess is the following: in the proof given in Sutton and Barto, the expectation is unrolled in time. At each time step, the agent follows the policy $\pi'$ for that particular time step and then follows $\pi$ from then on. In the limit of this process, the policy transforms from $\pi$ to $\pi'$. As long as the expression for the return inside the expectation is finite, the governing policy should be $\pi$; only in the limit of this process does the governing policy transform to $\pi'$.

Best Answer

They never quite spell it out, but an expression like: \begin{align} E_{\pi'}[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t=s] \end{align} means "the expected discounted value when starting in state $s$, choosing actions according to $\pi'$ for the next time step, and according to $\pi$ thereafter", while: \begin{align} E_{\pi'}[R_{t+1} + \gamma R_{t+2} + \gamma^2 v_\pi(S_{t+2}) | S_t=s] \end{align} means "the expected discounted value when starting in state $s$, choosing actions according to $\pi'$ for the next TWO timesteps, and according to $\pi$ thereafter", etc.

So we really have: \begin{align} E_{\pi'}[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t=s] = E[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t=s, A_t=\pi'(s)] \end{align} and if we look up to the beginning of section 4.2 on Policy Improvement, we can see that this is equal to $q(s, \pi'(s))$. The reason that they have these two different expressions for $q(s, \pi(s))$ is that the first is needed because to complete the proof they need to be able to talk about following $\pi'$ for increasingly longer spans of time, and the second is simply the definition of a Q-function for a deterministic policy.

Best Answer

Related Solutions

Solved – epsilon-greedy policy improvement

Related Question