Solved – Which approach is better in feature learning, deep autoencoders or stacked autoencoders

autoencodersdeep learningneural networks

In order to extract some useful features from raw data (purely unsupervised tasks), we could use autoencoders. To me, there seem to be two approaches of training.

  1. We could do the training layer by layer and get a stacked auto encoder, say the raw data has 100 variables, the first autoencoder is developed using 50 hidden units. The resulting 50 activations are used as input for the second auto encoder, say using 10 hidden units. So now we have a stacked autoencoder and the 10 activations can be regarded as 10 new features for further use.

  2. Meanwhile, we could simply use a deep auto encoder architecture, and the architecture is like 100-50-10-50-100, where 50-10-50 are the hidden units in three hidden layers. Again, we could extract the 10 activations in the middle as new features.

I am wondering what's the essential difference between these two approaches. Which one is better in learning features? Thanks in advance for any help.

Best Answer

Unsupervised, layer-wise pretraining was one of the early innovations that made it possible to use deep networks in practice. Since then, other tricks have been discovered that made layerwise pre-training unnecessary in many cases. Rectified linear units (ReLUs) are one example.

Glorot et al. (2011). Deep Sparse Rectifier Neural Networks.

Using deep autoencoders with ReLUs, they found that unsupervised pretraining was unnecessary and, in some cases, performance was better without it. But, they did find that unsupervised pretraining can help in a semi-supervised setting, when unlabeled data is available.

Optimization methods are another class of tricks. For example, Hessian-free (HF) optimization uses second order information to compute the update directions. The following paper found that HF optimization made unsupervised pre-training unnecessary for training deep autoencoders.

Martens (2010). Deep learning via Hessian-free optimization.

My impression is that unsupervised, layer-wise pretraining has generally fallen out of favor, except for specific circumstances (e.g. the semisupervised case). For related discussion, see here.

Related Question