Solved – Why does Q-Learning use epsilon-greedy during testing

machine learningq-learningreinforcement learning

In DeepMind's paper on Deep Q-Learning for Atari video games (here), they use an epsilon-greedy method for exploration during training. This means that when an action is selected in training, it is either chosen as the action with the highest q-value, or a random action. Choosing between these two is random and based on the value of epsilon, and epsilon is annealed during training such that initially, lots of random actions are taken (exploration), but as training progresses, lots of actions with the maximum q-values are taken (exploitation).

Then, during testing, they also use this epsilon-greedy method, but with epsilon at a very low value, such that there is a strong bias towards exploitation over exploration, favouring choosing the action with the highest q-value over a random action. However, random actions are still sometimes chosen (5 % of the time).

My questions are:

  1. Why is any exploration necessary at all at this point, given that training has already been done?

  2. If the system has learned the optimal policy, then why can't the action always be chosen as the one with the highest q-value?

  3. Shouldn't exploration be done only in training, and then once the optimal policy is learned, the agent can just repeatedly choose the optimal action?

Best Answer

In the nature paper they mention:

The trained agents were evaluated by playing each game 30 times for up to 5 min each time with different initial random conditions (‘noop’;see Extended Data Table 1) and an e-greedy policy with epsilon 0.05. This procedure is adopted to minimize the possibility of overfitting during evaluation.

I think what they mean is 'to nullify the negative effects of over / under fitting'. Using epsilon of 0 is a fully exploitative (as you point out) choice and makes a strong statement.

For instance, consider a labyrinth game where the agent’s current Q-estimates are converged to the optimal policy except for one grid, where it greedily chooses to move toward a boundary that results in it remaining in the same grid. If the agent reaches any such state, and it is choosing the Max Q action, it will be stuck there for eternity. However, keeping a vaguely explorative / stochastic element in its policy (like a tiny amount of epsilon) allows it to get out of such states.

Having said that, from the code implementations I have looked at (and coded myself) in practice performance is often times measured with greedy policy for the exact reasons you list in your question.

Related Question