Solved – Transfer learning: How and why retrain only final layers of a network

backpropagationmachine learningneural networkstransfer learning

In this video, Prof. Andrew Ng says regarding transfer learning:

Depending on how much data you have, you might just retrain the new layers of the network, or maybe you could retrain even more layers of this neural network.

The new layers he is referring to are ones that are added to replace the original output layer.

At this point Prof. Ng says:

If you have enough data, you could also retrain all the layers of rest of the network.

  1. In transfer learning, is there any difference in how backprop is applied when only training the last few layers?

  2. Why would one want to avoid retaining all the layers of a transfer learning network if the fine-tuning dataset was small? Ie, (if I understand it correctly), why would one not want to apply normal back-propagation through to the input layer?

Best Answer

Why would one want to avoid retaining all the layers of a transfer learning network if the fine-tuning dataset was small? Ie, (if I understand it correctly), why would one not want to apply normal back-propagation through to the input layer?

If the new dataset is small, the reason to restrict training to the new layers is to avoid overfitting. The entire network contains many more parameters and, in the small data regime, there's a higher chance of finding a solution that fits the training set but doesn't generalize well.

The idea behind transfer learning is that the original network has learned an internal representation that will also work well for the new task. This representation is given by the output of the final layer we keep from the original network. By training only the new layers, we simply keep that representation and learn how to processes it for the new task. Because the new layers contain fewer parameters than the entire network, there's less risk of overfitting.

In transfer learning, is there any difference in how backprop is applied when only training the last few layers?

There's no difference it how gradients are computed or how they're used to update the parameters. The only difference is that parameters for the early layers are held fixed, so these components of the gradient need not be computed.