Reinforcement Learning – How to Handle a Changing Action Space in Reinforcement Learning

I'm training a Reinforcement Model playing a game with self learning.(A second instance is its opponent). An agent has a set of possible action to choose from in each state. Those actions usually remain the same. Q-Learning tries than to map best actions to highest rewards. DQN tries to estimate Q values for unseen states.

I have now an example, where at some time some actions can not be taken. In fact the remaining possible actions get fewer and fewer leading finally to only one possible action before the game ends. How do I handle that? Do I simply give a huge negativ reward when an action is chosen which can not be taken and let it choose again? In this way the model has to learn that those actions can not be taken in certain situations.

Is there maybe a different approach which would neglect learning this?

Best Answer

You don't need to do anything special to handle this. The only thing you need to change is to not take any illegal actions.

The typical Q-learning greedy policy is $\pi(s) = \text{argmax}_{a \in \mathcal{A}} \hat q(s,a)$ and the epsilon-greedy rollout policy is very similar. Simply replace the action space $\mathcal{A}$ with just the legal actions $\mathcal{A}_\text{legal}(s)$.

Best Answer

Related Solutions

Related Question