Solved – epsilon-greedy policy improvement

q-learningreinforcement learning

I am learning reinforcement learning from David Silver's open course and Richard Sutton's book. While I enjoy the course and the book much, I am currently confused in $\epsilon$-greedy policy improvement.

Both the book and the open course have a theorem saying that

For any $\epsilon$-greedy policy $\pi$, the $\epsilon$-greedy policy $\pi'$ with respect to $q_\pi$ is an improvement, i.e., $v_{\pi'}(s)\ge v_{\pi}(s)$

which is proved by
prove of the $\epsilon$-greedy policy improvement theorem

where the inequality holds because the $\max$ operation is greater than equal to an arbitrary weighted sum. (m is the number of actions.)

However, the theorem does not make sense to me, because if $\epsilon\approx 1$, what the theorem implies is that an (almost) random policy would be better than the current one.

When I was walking through the proof, I found that the weights should be nonnegative, i.e., $\pi(a|s)-\epsilon/m\ge 0$ indicating that $\epsilon$ is bounded by $\epsilon \le \min_a m \pi(a|s)$.

Further, in determinsitic (e.g., greedy) policies, $\pi(a|s)=0$ for $a\ne\arg\max_a q_\pi(s,a)$. Then the theorem tells little.

Could anyone verify my understanding or shed more light in the theorem?

Best Answer

Ok! If you remember generalized policy iteration, the idea is a policy $\pi$ is first evaluated $v_\pi(s)$ and then improved, into $\pi'$, by acting greedily with respect to our original policy evaluation: $\pi' \rightarrow greedy(v_\pi(s))$. The theorem basically states that acting $\varepsilon$-greedy $still$ will result in policy improvement i.e. a policy with greater reward $v_{\pi'}(s) \geq v_\pi(s)$. The theorem works this out in reverse.

Line 1 left-side: we evaluate our improved policy, $\pi'$, by an action-value function (instead of a value function) with respect to our original policy: $q_\pi(s,\pi'(s)).$

Line 1 right-side: $\sum_{a \in A}\pi'(a|s)q_\pi(s,a) \approx v_{\pi'}(s)$

Line 2: we expand $\pi'(a|s)$, which is the probability of taking an action in a given state, to conform to our definition of being $\varepsilon$-greedy: we choose the max action with $1-\varepsilon$ probability or a random action (which could include the max action) with $\varepsilon/m$ probability where $m$ is the total number of actions.

Line 3: This is by far the most fun part of the proof. $max_{a \in A}q_\pi(s,a)$ is replaced by $\sum_{a \in A}\frac{\pi(a|s)-\varepsilon/m}{1-\varepsilon}q_\pi(s,a).$ The claim is that the max action under $\pi'$ is greater than or equal to a weighted sum of actions under $\pi$ for any arbitrarily chosen max action. To see this, imagine there's two actions we can choose from $A = \{a_1,a_2\}$ where $a_2$ is arbitrarily selected as the max under $\pi$: $\sum_{a \in A}\frac{\pi(a|s)-\varepsilon/2}{1-\varepsilon}q_\pi(s,a) = \frac{(\varepsilon/2 -\varepsilon/2)q_\pi(s,a_1) + ((1-\varepsilon) + \varepsilon/2 -\varepsilon/2)q_\pi(s,a_2)}{1-\varepsilon} = \frac{(1-\varepsilon)q_\pi(s,a_2)}{1-\varepsilon} = q_\pi(s,a_2).$ The equality is satisfied when the max action under $\pi'$ is the same as $\pi$, namely $a_2$: $ max_{a \in A}q_\pi(s,a) = q_\pi(s,a_2)$. The inequality is satisfied when the max action under $\pi'$ is not $a_2$ in which case: $max_{a \in A}q_\pi(s,a) > q_\pi(s,a_2)$.

Line 4: cancel terms and simplify.

What is crucial to understand is that generalized policy iteration has multiple iterations of evaluation and improvement. The policy under $\pi$ can be different from $\pi'$(hence the difference in selecting $a_2$ as max) because each were improved under different evaluations. This theorem is essentially saying, "Hey, greedily selecting actions after looking at my last choice of greedily selecting actions will either improve my overall reward or just keep it the same." One subtlety (for understanding line 2) is actually (not approximately) $\sum_{a \in A}\pi'(a|s)q_{\pi'}(s,a) = v_{\pi'}(s).$