Solved – How to Fine Tune a pre-trained network

deep learningmachine learningtransfer learning

I'm looking into using Transfer Learning to take the ResNet50 model trained on ImageNet and fine tune it to my own dataset using Keras.

However, I feel I have some misconception about what exactly fine tuning is, and how to perform it.

In this paper, which I read many months back, I understood that transfer learning was a process where you took the first n layers from a pre-trained model, added on your own final layers for your task, and then fine tuning was where you did NOT freeze the weights from layers you transferred from the pre-trained model, but instead allowed them to update with a very low learning rate. I also understood that this method resulted in better generalisation, and results in general, than freezing the weights from the transferred layers.

However, every time I see fine tuning mentioned on the internet, people refer to freezing the weights from the transferred layers and applying a low learning rate to the new layers – only allowing their weights to update. As seen here.

This answer also recommends freezing the weights from the transferred layers.

I just don't see how this advice lines up with the results from the paper. It suggests that taking a large number of layers from the original network and freezing their weights will give a poor result, whereas allowing the weights from the transferred layers to be fine-tuned will improve generalisation.

Best Answer

I would like to share my understanding here. Here is a thesis and in its related work author has explained Transfer learning and Fine-Tuning. Also, the survey on Transfer Learning is a good read to understand these concepts in detail.

  1. Unsupervised pre-training is a good strategy to train deep neural networks for supervised and unsupervised tasks.
  2. Fine-tuning can be seen as an extension of the above approach where the learned layers are allowed to retrain or fine-tune on the domain specific task.
  3. Transfer learning, on the other hand, requires two different task, where learning from one distribution can be transferred to another. [These points are taken from the related work of this thesis]

Now, I think your understanding is correct about Transfer learning and Fine-Tuning. But, Freezing the weights is a choice that you get, if you don't freeze then we call that the network is now fine-tuned on the domain-specific data. And yes, it should usually provide better generalization. On the other hand, if you freeze the weights depends on the problem and type of network you have. For example, IMAGENET layers are widely used to classify images and its layers are frozen (1) as its computationally expensive (2) Imagenet data covers a large distribution of the data and (3) the last layer is usually enough to capture the small variations that a domain-specific image. This is good because of strong representation capacity of Imagenet and may not be true for every model. Hence depending on the case one should empirically answer this question precisely.