Solved – ny difference between training a stacked autoencoder and a 2-layers neural network

autoencodersdeep learningneural networks

Let's say I am writing an algorithm for building 2-layers stacked autoencoder and 2-layers neural network. Are they the same things or difference?

What I understand is that when I build a stacked autoencoder, I would build layer by layer. For neural network, I would initialize all the parameters in the netowork, and then for each data point, I pass it through the network and calculate the loss (e.g., euclean distance) and do backpropagation.

Best Answer

Stacked autoencoders and the multi-layer neural networks are different. In practice, you'll have the two networks share weights and possibly share memory buffers. So in your implementation the two networks become entwined.

Typically, autoencoders are trained in an unsupervised, greedy, layer-wise fashion. (No labels, begin training with just the first layer of the network and then add new layers as you go.) The weights can be learned using a variety of techniques ranging from "batch" gradient descent (please don't do that), to mini-batch stochastic gradient descent (SGD), to quasi-Newton methods like L-BFGS.

The idea is that the weights learned in a unsupervised manner to minimize reconstruction error for the representation learning task offer a good starting point to initialize a network for a supervised discriminative task such as classification or similarity. I.e., the network learns something about the underlying distribution by looking at the unlabeled data, allowing it to discriminate between labeled data. However, the weights still need to be "fine-tuned" for this new task. So add a logistic regression layer on the top of the network and then do supervised learning with a labeled dataset. The fine tuning step will do gradient descent and adjust the weights for all layers in the network simultaneously.

The advantages to this way of training neural nets are:

  1. Unsupervised training lets you show the network more data because it's much easier to get large unsupervised datasets than it is to get labeled ones.
  2. You can use the pre-trained network as a "jumping off point" for training new classifiers so you don't have to start from scratch each time.

For the paper, see Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.