Solved – the architecture of a stacked convolutional autoencoder

autoencodersdeep learningdeep-belief-networksneural networks

So I am trying to do pretraining on images of humans using convolutional nets. I read the papers (Paper1 and Paper2) and this stackoverflow link, but I am not sure I am understand the structure of the nets (it is not well defined in the papers).

Questions:

I can have my input followed by a noise layer followed by a conv layer, followed by a pooling layer – there after – do I de-pool before I give my output (which is same a my input image)?

Say I have several (135,240) images. If I use 32, (12,21) kernels, followed by (2,2) pooling, I will end up with 32 (62, 110) feature maps. Now do I de-pool to get 32 (124, 220) feature maps and then flatten them? before giving my (135,240) output layer?
If I have multiple such conv-pool layers, should I train them one by one – like in stacked denoised autoencoders? Or – can I have something like input-conv-pool-conv-pool-conv-pool-output(output being same as input)? In that case, how is the pooling, depooling supposed to be managed? Should I only de-pool in the last pool layer before output? And again – what should be the resize factor of that de-pooling? Is the intention to bring the feature maps back to the shape of the input?
Should I be introducing noise layers after every conv-pool-depool layer?
And then when fine tuning – am I supposed to just remove the de-pooling layers and leave the rest the same. Or should I remove both the noise layers and de-pooling layers
Can any one point me to a url / paper which has detailed the architecture of such a stacked convolutional auto encoder to do pre training on images?

Best Answer

I am currently exploring stacked-convolutional autoencoders.

I will try and answer some of your questions to the best of my knowledge. Mind you, I might be wrong so take it with a grain of salt.

Yes, you have to "reverse" pool and then convolve with a set of filters to recover your output image. A standard neural network (considering MNIST data as input, i.e. 28x28 input dimensions) would be:

    28x28(input) -- convolve with 5 filters, each filter 5x5 -->  5 @ 28 x 28 maps -- maxPooling --> 5 @ 14 x 14 (Hidden layer) -- reverse-maxPool --> 5 @ 28 x 28 -- convolve with 5 filters, each filter 5x5 --> 28x28 (output)

My understanding is that conventionally that is what one should do, i.e. train each layer separately. After that you stack the layers and train the entire network once more using the pre-trained weights. However, Yohsua Bengio has some research (the reference escapes my memory) showcasing that one could construct a fully-stacked network and train from scratch.
My understanding is that "noise layer" is there to introduce robustness/variability in the input so that the training does not overfit.
As long as you are still "training" pre-training or fine-tuning, I think the reconstruction part (i.e. reversePooling, de-convolution etc) is necesary. Otherwise how should one perform error-back-propagation to tune weights?
I have tried browsing through numerous papers, but the architecture is never explained in full. If you find any please do let me know.

Related Solutions

Solved – Cannot make this autoencoder network function properly (with convolutional and maxpool layers)

You might gain more insight by visualizing the weights instead of just the reconstructions. I had a similar problem when my biases were misconfigured. Everything below is written based on my experiences writing my own learning library. You can see the code here on Github http://github.com/josephcatrambone/aij.

Here is a screenshot of my program when there are no biases. This is after only maybe ten epochs since I'm in a hurry to finish this writeup:

The weight update is done by these operations:

weights.add_i(positiveProduct.subtract(negativeProduct).elementMultiply(learningRate / (float) batchSize));
//visibleBias.add_i(batch.subtract(negativeVisibleProbabilities).meanRow().elementMultiply(learningRate));
//hiddenBias.add_i(positiveHiddenProbabilities.subtract(negativeHiddenProbabilities).meanRow().elementMultiply(learningRate));

If I uncomment the visible bias code, I get this result:

If I screw up the sign of the visible bias code (subtracting instead of adding):

visibleBias.subtract_i(batch.subtract(negativeVisibleProbabilities).meanRow().elementMultiply(learningRate));

I get this image:

Which snowballs and eventually reaches something like what you have above. Check the signage of your error functions.

Solved – How does a Stacked AutoEncoder increases performance of a Convolutional Neural Network in image classification tasks

For image classification task, how can a stacked auto-encoder help an traditional Convolutional Neural Network?

As mentioned in the paper, we can use the pre-trained weights to initialize CNN layers, although that essentially doesn't add anything to the CNN, it normally helps setting a good starting point for training (especially when there's insufficient amount of labeled data).

any pre-trained step before first convolution operation like Dimensionally Reduction or AutoEncoder output can be used as input image instead of real image data in CNN

Becaues of CNN's local connectivity, if the topology of data is lost after dimensionality reduction, then CNNs would no longer be appropriate.

For example, suppose our data are images, if we see each pixel as a dimension, and use PCA to do dimensionality reduction, then the new representation of a image will be a vector and no longer preserves the original 2D topology (and correlation between adjacent pixels). So in this case it can not be used directly with 2D CNNs (there are ways to recover the topology though).

Using the AutoEncoder output should work well with CNNs, as it can be seen as adding an additional layer (with fixed parameters) between the CNN and the input.

how much it affects the performance of Convolution Neural Network in context of image classification tasks

I happened to have done a related project at college, where I tried to label each part of an image as road, sky or else. Although the results are far from satisfactory, it might give some ideas about how those pre-processing techniques affects the performance.

(1) image of a clear road (2) outcome of a simple two-layer CNN

(3) CNN with first layer initialized by pre-trained CAE (4) CNN with ZCA whitening

The CNNs are trained using SGD with fixed learning rates. Tested on the KITTI road category data set, the error rate of method (2) is around 14%, and the error rates of method (3) and (4) are around 12%.

Please correct me where I'm wrong. :)

Best Answer

Related Solutions

Solved – Cannot make this autoencoder network function properly (with convolutional and maxpool layers)

Solved – How does a Stacked AutoEncoder increases performance of a Convolutional Neural Network in image classification tasks

Related Question