Solved – Monte Carlo $\epsilon$ – greedy policy is better than $\epsilon$- soft policy

In the RL book of Barto and Sutton, the authors have proved that any $\epsilon$-greedy policy with respect to $q_{\pi}$ is an improvement over any $\epsilon$-soft policy $\pi$ is assured by the policy improvement theorem. Let $\pi^{'}$ be the $\epsilon$-greedy policy. In this derivation, I couldn't understand how the authors the authors went from equation 1 to equation 2.

Equation 1 :
$ q_{\pi}(s,\pi^{'}(s)) = \sum_{a}\pi^{'}(a|s)q(s,a)$

Equation 2 :
$ q_{\pi}(s,\pi^{'}(s)) = \frac{\epsilon}{|A(s)|}\sum_{a} q(s,a) + ( 1 – \epsilon)max_{a}q_{\pi}(s,a)$

As far as I understand we are choosing non-greedy actions with $\epsilon$ probability and the greedy actions i.e. actions with $1 – \epsilon$
probability but then how did we end up with $\frac{\epsilon}{A(s)}$ as a weight for non-greedy actions shouldn't it be $\frac{\epsilon}{number\ of\ non-greedy \ actions}$ and this would get the summation of the weights to 1 as they are probabilities after all.

Am I missing something here? please help me out I am a beginner in RL thanks.

Solved – Monte Carlo $\epsilon$ – greedy policy is better than $\epsilon$- soft policy

Best Answer

Related Question

Best Answer

Related Solutions

Solved – epsilon-greedy policy improvement

Related Question