Solved – Stacked shallow autoencoders vs. deep autoencoders

artificial intelligenceautoencodersdeep learningmachine learning

In LeCun et. all "Deep Learning", Chapter 14, page 506, I found the following statement:

"A common strategy for training a deep autoencoder is to greedily pretrain the deep architecture by training a stack of shallow autoencoders, so we often encounter shallow autoencoders, even when the ultimate goal is to train a deep autoencoder."

I was just following the Keras tutorial on autoencoders, and they have a section on how to code up a deep autoencoder in Keras. I'm reproducing the code they give (using the MNIST dataset) below:

input_img = Input(shape=(784,))
encoded = Dense(128, activation='relu')(input_img)
encoded = Dense(64, activation='relu')(encoded)
encoded = Dense(32, activation='relu')(encoded)

decoded = Dense(64, activation='relu')(encoded)
decoded = Dense(128, activation='relu')(decoded)
decoded = Dense(784, activation='sigmoid')(decoded)

autoencoder = Model(input_img, decoded)
autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')

autoencoder.fit(x_train, x_train,
                epochs=100,
                batch_size=256,
                shuffle=True,
                validation_data=(x_test, x_test))

My questions are:

  1. Does the code above represent stacked autoencoders or a deep autoencoder?
  2. If it is a deep autoencoder, how would you alter the above code to instead produce a stacked autoencoder? And vice versa?
  3. What is the advantage of one approach vs another?

Best Answer

The code is a single autoencoder: three layers of encoding and three layers of decoding. "Stacking" is to literally feed the output of one block to the input of the next block, so if you took this code, repeated it and linked outputs to inputs that would be a stacked autoencoder.

Stacked Denoising Autoencoders are a thing for unsupervised/semisupervised learning, I believe.

In answer to your comment below, remember that any deep network is created by stacking layers. It's true that if there were no non-linearities in the layers you could collapse the entire network to a single layer, but there are non-linearities and you can't. "Stacking" isn't generally used to describe connecting simple layers, but that's what it is, and stacking autoencoders -- or other blocks of layers -- is just a way of making more complex networks.

(For example, it's common in CNN's to have two convolutional layers followed by a pooling layer. These convolutional blocks are stacked.)

Related Question