Solved – the architecture of a stacked convolutional autoencoder

autoencodersdeep learningdeep-belief-networksneural networks

So I am trying to do pretraining on images of humans using convolutional nets. I read the papers (Paper1 and Paper2) and this stackoverflow link, but I am not sure I am understand the structure of the nets (it is not well defined in the papers).

Questions:

  • I can have my input followed by a noise layer followed by a conv layer, followed by a pooling layer – there after – do I de-pool before I give my output (which is same a my input image)?

    Say I have several (135,240) images. If I use 32, (12,21) kernels, followed by (2,2) pooling, I will end up with 32 (62, 110) feature maps. Now do I de-pool to get 32 (124, 220) feature maps and then flatten them? before giving my (135,240) output layer?

  • If I have multiple such conv-pool layers, should I train them one by one – like in stacked denoised autoencoders? Or – can I have something like input-conv-pool-conv-pool-conv-pool-output(output being same as input)? In that case, how is the pooling, depooling supposed to be managed? Should I only de-pool in the last pool layer before output? And again – what should be the resize factor of that de-pooling? Is the intention to bring the feature maps back to the shape of the input?

  • Should I be introducing noise layers after every conv-pool-depool layer?

  • And then when fine tuning – am I supposed to just remove the de-pooling layers and leave the rest the same. Or should I remove both the noise layers and de-pooling layers

  • Can any one point me to a url / paper which has detailed the architecture of such a stacked convolutional auto encoder to do pre training on images?

Best Answer

I am currently exploring stacked-convolutional autoencoders.

I will try and answer some of your questions to the best of my knowledge. Mind you, I might be wrong so take it with a grain of salt.

  1. Yes, you have to "reverse" pool and then convolve with a set of filters to recover your output image. A standard neural network (considering MNIST data as input, i.e. 28x28 input dimensions) would be:

        28x28(input) -- convolve with 5 filters, each filter 5x5 -->  5 @ 28 x 28 maps -- maxPooling --> 5 @ 14 x 14 (Hidden layer) -- reverse-maxPool --> 5 @ 28 x 28 -- convolve with 5 filters, each filter 5x5 --> 28x28 (output)
    
  2. My understanding is that conventionally that is what one should do, i.e. train each layer separately. After that you stack the layers and train the entire network once more using the pre-trained weights. However, Yohsua Bengio has some research (the reference escapes my memory) showcasing that one could construct a fully-stacked network and train from scratch.

  3. My understanding is that "noise layer" is there to introduce robustness/variability in the input so that the training does not overfit.

  4. As long as you are still "training" pre-training or fine-tuning, I think the reconstruction part (i.e. reversePooling, de-convolution etc) is necesary. Otherwise how should one perform error-back-propagation to tune weights?

  5. I have tried browsing through numerous papers, but the architecture is never explained in full. If you find any please do let me know.