Solved – Why does Deep Q-Learning have “do nothing” actions

deep learningreinforcement learning

In DeepMind's paper on Deep Q-Learning for Atari video games (here), they have a parameter 'no-op max' which is the maximum number of "do nothing" actions to be performed by the agent at the start of an episode.

What's this for? And is this for some specific episodes or for all episodes? I mean for some games, if the agent does do anything, it will lose immediately and the episode doesn't continue to a point that this agent can take some actions.

Thanks

Best Answer

At the start of every game, a random number of no-op actions are played (with the maximum number of such actions controlled by that parameter) to introduce variety in the initial game states.

If an agent starts from exactly the same initial state every time it plays the same game, they're afraid that the Reinforcement Learning will simply learn to memorize a good sequence of actions from that initial state, rather than learning to observe the current state and select a good action based on that observation (which is the more interesting thing we're interested in learning). The idea is that by introducing randomness in the state we "start playing from", it should become impossible / more difficult for the agent to "cheat" and simply memorize a complete sequence of actions from a single specific initial state.

Note that, in 2017, it has been argued in the Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents paper that these kinds of sequences no-op actions are not as effective at the goal described above as we would like, and an alternative solution is proposed which consists of introducing stochasticity throughout the entire game through "sticky actions" (basically sometimes randomly continue with the most recently-selected action rather than a new action selected by the agent).

Related Question