Solved – How to initialize and update Q network weights in DQN with delayed/sparse reward

neural networksreinforcement learning

I cannot wrap my head around one thing about a delayed/sparse reward reinforcement learning. In my problem I would like to teach a neural network to be able to play a simple two-player game. Or in other words I would like to use a neural network as a Q function in reinforcement learning.

This is nothing new since DeepMind etc. have already published incredible results even for much more difficult games.

Now here's the specific question:
The only reward I supply to my agent is a win/loss reward at the end of the game. During the game, no reward is given. What should be the values provided to the Q-value neural network then?

Let's take Nim as an example (https://en.wikipedia.org/wiki/Nim) — a game where we begin with N coins and players then alternate to take 1 to 3 coins. The player which takes the last coin loses. In case we would start with 1000 coins it would take really long before the reward of win (+1) or loss (-1) would be given to the network.

Every update batch sampled from the experiene replay array to the network would then consist of mainly transitions with reward of 0.

How does one initialize such network (meaning the weights and biases and learning rate) so it can learn the optimal or at least "correct" policy?

Taking the algorithm from DeepMind's Mnih at al. (https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf), where the targets for the network are $y_t = r_t + \gamma \max_{a} Q(s_t, a; \theta)$ for non-terminal future state and $y_t = r_t$ otherwise.

The output of the network would be a vector of size 3 — action of taking 1, 2 or 3 coins. Only target for the action which was selected during exploration/exploitation would be updated according to the formulas above. Targets for the other actions would remain the Q values predicted for the given state we have selected the action in.

For non-terminal states the $r_t = 0$ in my case. Each batch in which I would teach the network would then consist of majority of such cases and only few of terminal cases. The network is then heavily updated with values given by non-trained network and the actual data to train are really sparse.

How does one perform any learning in such case? Is it even possible with DQN?

Best Answer

The only reward I supply to my agent is a win/loss reward at the end of the game. During the game, no reward is given. What should be the values provided to the Q-value neural network then?

Typically you feed in the state and the output is a $Q$-value for every action which can be taken from that state. It doesn't matter that reward isn't given at every step.

Every update batch sampled from the experiene replay array to the network would then consist of mainly transitions with reward of 0.

That is true. but that doesn't prevent the Q-values from "backflowing" from the terminal states to the rest of the states. Consider a game with one action in which you start in state A, then go to B, C, and get 1 reward at D. After one game, your Q-value for D is 1, then after another update your Q-value at C becomes 1 as well, and so on.

How does one initialize such network (meaning the weights and biases and learning rate) so it can learn the optimal or at least "correct" policy?

Well you should pick any reasonable weight initialization scheme -- hyperparameter tuning must be done as usual.

Only target for the action which was selected during exploration/exploitation would be updated according to the formulas above. Targets for the other actions would remain the Q values predicted for the given state we have selected the action in.

When you update the parameters of the target network $\theta$, it necessarily changes all the $Q$-values, not just the $Q$-value of the selected action.

How does one perform any learning in such case? Is it even possible with DQN?

It is possible, although the sparser the rewards get the more difficult the problem is. I remember when replicating some DQN results on the Atari Enduro game, it took 1-2 million steps before the network started getting more than 0 reward.

Related Question