Solved – Monte Carlo $\epsilon$ – greedy policy is better than $\epsilon$- soft policy

monte carloreinforcement learning

In the RL book of Barto and Sutton, the authors have proved that any $\epsilon$-greedy policy with respect to $q_{\pi}$ is an improvement over any $\epsilon$-soft policy $\pi$ is assured by the policy improvement theorem. Let $\pi^{'}$ be the $\epsilon$-greedy policy. In this derivation, I couldn't understand how the authors the authors went from equation 1 to equation 2.

Equation 1 :
$ q_{\pi}(s,\pi^{'}(s)) = \sum_{a}\pi^{'}(a|s)q(s,a)$

Equation 2 :
$ q_{\pi}(s,\pi^{'}(s)) = \frac{\epsilon}{|A(s)|}\sum_{a} q(s,a) + ( 1 – \epsilon)max_{a}q_{\pi}(s,a)$

As far as I understand we are choosing non-greedy actions with $\epsilon$ probability and the greedy actions i.e. actions with $1 – \epsilon$
probability but then how did we end up with $\frac{\epsilon}{A(s)}$ as a weight for non-greedy actions shouldn't it be $\frac{\epsilon}{number\ of\ non-greedy \ actions}$ and this would get the summation of the weights to 1 as they are probabilities after all.

Am I missing something here? please help me out I am a beginner in RL thanks.

Best Answer

By a non-greedy action, they mean to pick an action that is available for state $s$, $A(s)$, with equal probability.

Hence that is how we have $\frac1{|A(s)|}$.

It is possible that we pick an action which coincides with the greedy strategy.

Related Question