Machine Learning – How to Perform Transfer Learning with Autoencoders

machine learningneural networkstransfer learning

I have been thinking to train a variational autoencoder on a larger texture dataset, so that I can fine-tune it on my specific texture dataset and hope that the reconstruction would be better.

I did not really find anything on how to do this fine-tuning with autoencoders. Do I add a layer before and after the latent vector or do I have to do something else?

Best Answer

I do not know if I am getting your question correct. But from what I understand you can train all layers on a large texture data-set then you can freeze the weights of all layers just before bottleneck one(the one you extract new features) and train remaining layers on the new data-set.

If your is aim is just reconstruction not dimension reduction: Then you can do the same thing but instead you freeze all the layers except last layer, then you end up optimizing weights of only last layer

Either way code will be similar to below example, only specific nets architecture will change:

https://keras.io/guides/transfer_learning/

Hope this helps, good luck

Edit:

Then you should follow the first approach. Reconstruction error might be misleading because autoencoders also risk overfitting in the sense that your extracted features might be useless when you feed them to a new model. The best way to find correct encoding is trial and error: use your extracted features for your specific goal and see which one yields better performance. You can think of transfer learning as acquiring fundamental knowledge about a particular topic; then, you specialize in a sub-topic. Consider the example of general news and financial news, they are both news, but the context is entirely different. So I can train my autoencoders on general news so that it sees lots of various examples and constructs meaningful word vectors. Then I freeze layers till bottleneck one, so my model preserves available information acquired from sizeable textual data. After that, I can train the remaining layers on my financial news data-set. By doing that, I force my model to learn specific finance jargon. Instead, if I just trained my model on financial news, reconstruction error probably will be lower. Still, my extracted features might not be useful because I might not have enough variance in my financial news data-set. That is my autoencoder, just overfitting data in hand.

Related Solutions

Solved – the architecture of a stacked convolutional autoencoder

I am currently exploring stacked-convolutional autoencoders.

I will try and answer some of your questions to the best of my knowledge. Mind you, I might be wrong so take it with a grain of salt.

Yes, you have to "reverse" pool and then convolve with a set of filters to recover your output image. A standard neural network (considering MNIST data as input, i.e. 28x28 input dimensions) would be:

    28x28(input) -- convolve with 5 filters, each filter 5x5 -->  5 @ 28 x 28 maps -- maxPooling --> 5 @ 14 x 14 (Hidden layer) -- reverse-maxPool --> 5 @ 28 x 28 -- convolve with 5 filters, each filter 5x5 --> 28x28 (output)

My understanding is that conventionally that is what one should do, i.e. train each layer separately. After that you stack the layers and train the entire network once more using the pre-trained weights. However, Yohsua Bengio has some research (the reference escapes my memory) showcasing that one could construct a fully-stacked network and train from scratch.
My understanding is that "noise layer" is there to introduce robustness/variability in the input so that the training does not overfit.
As long as you are still "training" pre-training or fine-tuning, I think the reconstruction part (i.e. reversePooling, de-convolution etc) is necesary. Otherwise how should one perform error-back-propagation to tune weights?
I have tried browsing through numerous papers, but the architecture is never explained in full. If you find any please do let me know.

Solved – Optimizing parameters for CNN autoencoder based on training and validation loss

As you have correctly assumed, there are simply too many parameters to perform a full grid search. Luckily, many of those have relatively small effect compared to others:

Convolution kernel size (3x3 works universally quite well)
Convolution stride (2x2)
Batch size
Dropout is not needed for convolutional layers; also, for dense layers use it only if you experience overfitting

There are some rules of thumb for setting the learning rate and its decay, a popular one is to increase the learning rate and record the loss, which should yield a curve like this one (taken from this article):

The region of the steepest loss descent hints to the optimal learning rate choice. Alternatively, optimizers like Adam are robust to the choice of the learning rate (since the amount of change is adaptive for each parameter), but they have their own parameters to tune for optimal performance.

What remains are:

number of layers,
number of units per layer,
latent dimension

These should be easy to tune: start with a small network, increasing the size as long as your validation error gets better.

Best Answer

Related Solutions

Solved – the architecture of a stacked convolutional autoencoder

Solved – Optimizing parameters for CNN autoencoder based on training and validation loss

Related Question