Solved – CNN and kernel sizes: is upsampling useful

convolutiondeep learningreinforcement learning

I am playing with Deep Recurrent Q-Network in Reinforcement learning.
The architecture I am currently using is similar to the one presented in "Human-level control through deep reinforcement
learning" (Volodymyr Mnih, Koray Kavukcuoglu, David Silver et al.) Nature volume 518, pages 529–533 (26 February 2015):

The input to the neural network consists of an 84x84x3 image produced by the preprocessing map.

The first hidden layer convolves 32 filters of 8×8 with stride 4 with the input image and applies a rectifier nonlinearity.

The second hidden layer convolves 64 filters of 4×4 with stride 2, again followed by a rectifier nonlinearity.

This is followed by a third convolutional layer that convolves 64 filters of 3×3 with stride 1 and then a fourth conv layer with 512 filters of 7×7 with stride 1.

This is the convolutive part of the network: it works! However I thought that this is largely oversized with respect to my problem, a simple grid game where the state is represented by a RGB 11×11 matrix. In fact, the preprocessing function actually upsample the matrix in order to match the input shape of the model. What's the point in resizing the grid from 11×11 to 84×84 (and thus having to manage a way larger set of weights)?

However any manual attempts to define a simpler architecture that I tried is a failure, there's no learning at all!
For example, I tried the following convolutive module (input shape: 11x11x3):

  1. 32 filters, 4×4, stride 2
  2. 64 filters, 2×2, stride 1
  3. 512 filters, 3×3, stride 1

I've read many similar questions, but wasn't able to get a hint about this kind of problem (I'm a beginner, so no hyperparameters experience!). Could you offer any insights?

Best Answer

There are several potential problems I can see. First off, you can't use a 4x4 filter of stride 2 on 11x11 input. You won't be able to capture the entire input space (ie the last row and column) and it might be causing the network to fail without an error.

The reason why is kind of difficult to explain in words so here is a link giving a basic overview of conv nets. (If you're a beginner it would probably be helpful for you to read the all the CS231 course notes actually) If you scroll down a bit there is a moving diagram that will help you visualize why a stride of 2 won't work.

Also, after briefly looking through the code, I noticed several places where the input dimensions (84 x 84) are hardcoded. There's a possibility that you've missed adjusting the input dimensions somewhere.

My advice would be to first scan through all the files and make sure you've adjusted all the inputs to from 84 x 84 to 11 x 11. I would also just use 3x3 filters of stride 1 for all layers. You're input space is so small that you don't really need any downsampling (which is why you generally use larger strides). If it still doesn't work, you're going to need to sit down and fully understand what the code is doing at every step in order to debug it.

Related Question