Solved – How to combine stochastic policy with Q-value Iteration

reinforcement learning

I am trying to use a stochastic policy in my q-value iteration algorithm. Note I am not asking for Q-learning but rather Value Iteration that uses Q-value.

As I understand it, stochastic policy is a probability of choosing an action from a particular state. On the other hand, Q-value is a value of being in state-action pair. How do I combine both?

Best Answer

Value iteration is an approach used to find an optimal policy. If you want to constrain the solution to stochastic policies, then this may only be possible if you are willing to consider a restricted set of stochastic policies. Importantly, value iteration should work with $\epsilon$-greedy policies, because a variation of the Policy Improvement Theorem still holds. This means that repeatedly choosing the maximising action with $\pi(a|s) = 1 -\epsilon$ and evaluating this should still head towards an optimal solution.

Value iteration using action values will typically sweep through all state, action pairs $(s,a)$ updating the q values like this:

$$q(s,a) = \sum_{s',r} p(s',r|s,a)(r + \gamma max_{a'}[q(s',a')])$$

where $p(s',r|s,a)$ is the probability of observing state $s'$ and reward $r$ given $s,a$.

To adjust the formula above for an $\epsilon$-greedy approach, you just change the terms to calculate the bootstrap value, backing up all possible outcomes from your stochastic policy, weighted according to your policy probability function $\pi(a|s)$, which equals $1-\epsilon$ for the current maximum value action, and $\frac{\epsilon}{N_{actions}}$ for everything else (including another chance of selecting the current maximising action). The set of actions possible in state $s$ is noted as $\mathcal{A}(s)$ below, and it follows the notation used in Sutton & Barto's book Reinforcement Learning: An Introduction:

$$q(s,a) = \sum_{s',r} p(s',r|s,a)(r + \gamma( (1-\epsilon)\text{max}_{a'}[q(s',a')] + \frac{\epsilon}{|\mathcal{A}(s')|}\sum_{a'}q(s',a') ))$$

Use the above formula and perform the same repeated sweep through state,action pairs as before. The resulting policy may choose different "preferred" actions from a deterministic one, because it will tend to avoid risks caused by poor choices due to occasional random actions.

You can make similar update rules for other stochastic policies, but you should bear in mind that they may not have backing theory that guarantees convergence to an optimum variant of your policy. Also, if you end up without some maximisation over the $q(s',a')$ term in the algorithm, then you are likely to be performing policy evaluation of some arbitrary policy, as opposed to policy improvement.

Related Question