Solved – Architecture of autoencoders

autoencoders

Ordinary autoencoder architectures (not variational autoencoders, stacked denoising autoencoders, etc.) seem to only have three layers: the input, the hidden/code, and the output/reconstruction. Are there any examples of papers which used architectures consisting of multiple hidden layers? If not, what are the theoretical justifications for only using one hidden layer in an autoencoder?

Best Answer

Are there any examples of papers which used architectures consisting of multiple hidden layers?

Yes, e.g. look for "deep autoencoders" a.k.a. "stacked autoencoders", such as {1}:

Hugo Larochelle has the video on it: Neural networks [7.6] : Deep learning - deep autoencoder

Geoffrey Hinton also has a video on it: Lecture 15.2 — Deep autoencoders [Neural Networks for Machine Learning]

Examples of deep autoencoders which don't make use of pretraining: http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders

A good way to obtain good parameters for a stacked autoencoder is to use greedy layer-wise training.

E.g., {2} uses a stacked autoencoder with greedy layer-wise training.

Note that one can use autoencoders fancier than feedforward fully connected neural networks, e.g. {3}.

References:

{1} Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." science 313, no. 5786 (2006): 504-507. https://scholar.google.com/scholar?hl=en&q=Reducing+the+Dimensionality+of+Data+with+Neural+Networks&btnG=&as_sdt=1%2C22&as_sdtp= ; https://www.cs.toronto.edu/~hinton/science.pdf (~5k citations)
{2} Heydarzadeh, Mehrdad, Mehrdad Nourani, and Sarah Ostadabbas. "In-bed posture classification using deep autoencoders." In Engineering in Medicine and Biology Society (EMBC), 2016 IEEE 38th Annual International Conference of the, pp. 3839-3842. IEEE, 2016. https://scholar.google.com/scholar?cluster=16153787462804186587&hl=en&as_sdt=0,22
{3} Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, Koray Kavukcuoglu. Conditional Image Generation with PixelCNN Decoders. NIPS 2016. https://arxiv.org/abs/1606.05328 ; http://papers.nips.cc/paper/6527-tree-structured-reinforcement-learning-for-sequential-object-localization.pdf

Related Solutions

Solved – the architecture of a stacked convolutional autoencoder

I am currently exploring stacked-convolutional autoencoders.

I will try and answer some of your questions to the best of my knowledge. Mind you, I might be wrong so take it with a grain of salt.

Yes, you have to "reverse" pool and then convolve with a set of filters to recover your output image. A standard neural network (considering MNIST data as input, i.e. 28x28 input dimensions) would be:

    28x28(input) -- convolve with 5 filters, each filter 5x5 -->  5 @ 28 x 28 maps -- maxPooling --> 5 @ 14 x 14 (Hidden layer) -- reverse-maxPool --> 5 @ 28 x 28 -- convolve with 5 filters, each filter 5x5 --> 28x28 (output)

My understanding is that conventionally that is what one should do, i.e. train each layer separately. After that you stack the layers and train the entire network once more using the pre-trained weights. However, Yohsua Bengio has some research (the reference escapes my memory) showcasing that one could construct a fully-stacked network and train from scratch.
My understanding is that "noise layer" is there to introduce robustness/variability in the input so that the training does not overfit.
As long as you are still "training" pre-training or fine-tuning, I think the reconstruction part (i.e. reversePooling, de-convolution etc) is necesary. Otherwise how should one perform error-back-propagation to tune weights?
I have tried browsing through numerous papers, but the architecture is never explained in full. If you find any please do let me know.

Solved – Which approach is better in feature learning, deep autoencoders or stacked autoencoders

Unsupervised, layer-wise pretraining was one of the early innovations that made it possible to use deep networks in practice. Since then, other tricks have been discovered that made layerwise pre-training unnecessary in many cases. Rectified linear units (ReLUs) are one example.

Glorot et al. (2011). Deep Sparse Rectifier Neural Networks.

Using deep autoencoders with ReLUs, they found that unsupervised pretraining was unnecessary and, in some cases, performance was better without it. But, they did find that unsupervised pretraining can help in a semi-supervised setting, when unlabeled data is available.

Optimization methods are another class of tricks. For example, Hessian-free (HF) optimization uses second order information to compute the update directions. The following paper found that HF optimization made unsupervised pre-training unnecessary for training deep autoencoders.

Martens (2010). Deep learning via Hessian-free optimization.

My impression is that unsupervised, layer-wise pretraining has generally fallen out of favor, except for specific circumstances (e.g. the semisupervised case). For related discussion, see here.

Best Answer

Related Solutions

Solved – the architecture of a stacked convolutional autoencoder

Solved – Which approach is better in feature learning, deep autoencoders or stacked autoencoders

Related Question