Solved – Proof of Bellman Optimality Equation

machine learningoptimizationreinforcement learning

Following Barto and Sutton's "Reinforcement Learning: An Introduction", I am having trouble rigorously proving the Bellman Optimality Equation for finite MDPs.

Namely, why does $v_*(s) = \max\limits_{a \in A(s)} q_{\pi_*}(s, a)$?

My attempt to see this is true:

Let $v_* := \max\limits_{a \in A(s)} q_{\pi_*}(s, a)$

$v_*(s) = \sum\limits_{a \in A(s)} \pi_*(a | s) q_{\pi_*}(s, a) \leq v_*$

However, I'm not sure I see why equality must hold. I'm thinking that we construct $\pi'$ such that $\pi'(s) \in \arg\max \limits_{a \in A(s)} q_{\pi_*}(s, a)$, $\pi(s') = \pi_*(s')$ $\forall s' \neq s$ and show that $v_\pi(s) = v_*$?.

Intuitively I see this statement being true if we allow non-stationary policies and that stationary rewards should mean that we could "just take" our policy to be stationary, but I don't see the clear reasoning behind this.

In a similar vein, why does does $\pi_*(s) = \arg\max \limits_{a \in A(s)} q_{\pi_*}(s, a)$ constitute an optimal policy?

Best Answer

Before answering your question. We need to introduce the following relationship. Let's first take a look at the definitions of value function and action value function under policy $\pi$: \begin{align} &v_{\pi}(s)=E{\left[G_t|S_t=s\right]} \\ &q_{\pi}(s,a)=E{\left[G_t|S_t=s, A_t=a\right]} \end{align} where $G_t=\sum_{k=0}^{\infty}\gamma^kR_{t+k+1}$ is the return at time $t$. The relationship between these two value functions can be derived as \begin{align} v_{\pi}(s)&=E{\left[G_t|S_t=s\right]} \nonumber \\ &=\sum_{g_t} p(g_t|S_t=s)g_t \nonumber \\ &= \sum_{g_t}\sum_{a}p(g_t, a|S_t=s)g_t \nonumber \\ &= \sum_{a}p(a|S_t=s)\sum_{g_t}p(g_t|S_t=s, A_t=a)g_t \nonumber \\ &= \sum_{a}p(a|S_t=s)E{\left[G_t|S_t=s, A_t=a\right]} \nonumber \\ &= \sum_{a}p(a|S_t=s)q_{\pi}(s,a) \qquad (1) \end{align} The above equation is important. It describes the relationship between two fundamental value functions in reinforcement learning. It is valid for any policy. Moreover, if we have a deterministic policy, then $v_{\pi}(s)=q_{\pi}(s,\pi(s))$.

Now let's start answering your question by recalling the definitions of optimal policy, optimal state-value function, and optimal action-value function:

Optimal policy: If $v_{\pi}(s)\ge v_{\pi'}(s)$ for all $s\in \mathcal{S}$, then we say $\pi$ is better than or equal to $\pi'$. There is always at least one policy that is better than or equal to all other policies. This is an optimal policy.
Optimal state-value function: $\displaystyle v_*(s)=\max_{\pi}v_{\pi}(s)$, $\forall s\in\mathcal{S}$.
Optimal action-value function: $\displaystyle q_*(s, a)=\max_{\pi}q_{\pi}(s, a)$, $\forall s\in\mathcal{S}$ and $a\in A(s)$.

Now, let's assume we already know $q_*(s, a)$, then the following deterministic policy is apparently an optimal policy. \begin{align} p(a|s)=\begin{cases}\displaystyle 1, \quad a=\text{argmax}_{a\in A(s)}q_*(s, a)\\ 0, \quad \text{else} \end{cases}\label{o1} \end{align} In practice, we can only focus on deterministic policies since there always exists at least one deterministic policy that is optimal. By using this deterministic optimal policy in Eq. (1), we can obtain the following important relationship: \begin{align} v_{*}(s)=\max_{a\in A(s)}q_*(s, a) \end{align} This is the famous Bellman optimality equation.

Best Answer

Related Solutions

Reinforcement Learning – Why Is There Always at Least One Policy That Is Better Than or Equal to All Other Policies?

Solved – epsilon-greedy policy improvement

Related Question