Solved – DQN – How to feed the input of 4 still frames from a game as one single state input

neural networksq-learningreinforcement learning

I was reading this blog about Deep Q-Learning.

1- In the The input section of the blog, I wanted to know how do we feed the 4 still-frames/screenshots from the game, that represent the input state, into the Policy network? Will all 4 frames be fed in one flattened tensor (where one image ends the next one starts, forming a continuous row input, in one tensor)? Or will they be fed separately one after the other into the network?

2- For preprocessing the images, do we avoid using the Max-pooling stage? My understanding is this process eliminate the need for spacial/position recognition in image-feature recognition. While in normal Conv-Net this is important for recognising image features regardless of where they appear in space and distance (so we us max-pooling). In Q-learning for games, the space/position of different elements on the image is important. Therefore, we remove the use of Max-pooling from the proprocessing stage. Is this correct?

3- Can anyone recommend a good implementation resource of Deep Q-learning, written from scratch (in Python), i.e. without the use of out-of-the-box libraries, like PyTorch, Keras and Scikit-learn ..etc, for a game, where image frame feeds from the game is required as states input. I'm thinking perhaps implementing the model from scratch gives a better control over customisation and fine tuning of the hyper-parameters. Or is it better to use out-out-of-the-box library? Any code implementation on this would be super helpful.

Many thanks in advance.

Best Answer

Or will they be fed separately one after the other into the network?

The 4-frame "stack" is processed one frame at a time, so the tensor has 3 dimensions: it has shape (frame height, frame width, number of frames). It's not clear from your post, but the blog is probably referencing "Playing Atari with Deep Reinforcement Learning" by Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller which explains these choices and provides some citations for further reading.

do we avoid using the Max-pooling stage?

This depends on the specific model. Reading the primary literature about a model should clearly explain how it works.

Can anyone recommend a good implementation resource of Deep Q-learning, written from scratch (in Python), i.e. without the use of out-of-the-box libraries, like PyTorch, Keras and Scikit-learn ..etc, for a game, where image frame feeds from the game is required as states input.

There are lightweight implementations of reinforcement learning topics. One can be found in https://github.com/PacktPublishing/Deep-Reinforcement-Learning-Hands-On which is the code companion to Maxim Lapan's Deep Reinforcement Learning Hands-On.

However, this repository still uses pytorch for all of the generic neural network pieces. The reason is that there's simply no point to reinvent the wheel in terms of network construction, autograd, backprop and all the other standard neural network operations. The reinforcement learning portions of the code are very clearly written, although there is the occasional bug.

Related Question