Solved – How to train and fine-tune fully unsupervised deep neural networks

autoencodersdeep learningmachine learningneural networksunsupervised learning

In scenario 1, I had a multi-layer sparse autoencoder that tries to reproduce my input, so all my layers are trained together with random-initiated weights. Without a supervised layer, on my data this didn't learn any relevant information (the code works fine, verified as I've already used it in many other deep neural network problems)

In scenario 2, I simply train multiple auto-encoders in a greedy layer-wise training similar to that of deep learning (but without a supervised step in the end), each layer on the output of the hidden layer of the previous autoencoder. They'll now learn some patterns (as I see from the visualized weights) separately, but not awesome, as I'd expect it from single layer AEs.

So I've decided to try if now the pretrained layers connected into 1 multi-layer AE could perform better than the random-initialized version. As you see this is same as the idea of the fine-tuning step in deep neural networks.

But during my fine-tuning, instead of improvement, the neurons of all the layers seem to quickly converge towards an all-the-same pattern and end up learning nothing.

Question: What's the best configuration to train a fully unsupervised multi-layer reconstructive neural network? Layer-wise first and then some sort of fine tuning? Why is my configuration not working?

Best Answer

After some tests I've came up with a method that seems to give very good results, and as you'd expect from a 'fine-tuning' it improves the performance of all the layers:

Just like normally, during the greedy layer-wise learning phase, each new autoencoder tries to reconstruct the activations of the previous autoencoder's hidden layer. However, the last autoencoder (that will be the last layer of our multi-layer autoencoder during fine-tuning) is different, this one will use the activations of the previous layer and tries to reconstruct the 'global' input (ie the original input that was fed to the first layer).

This way when I connect all the layers and train them together, the multi-layer autoencoder actually reconstructs the original image in the final output. I found a huge improvement in the features learned, even without a supervised step.

I don't know if this is supposed to somehow correspond with standard implementations but I haven't found this trick anywhere before.

Related Question