Fine Tuning vs Transfer Learning vs Learning from Scratch in Deep Learning

computer visiondeep learningobject detectiontransfer learning

In my master thesis, I am researching on transfer learning on a specific use Case, a traffic sign detector implemented as a Single Shot Detector with a VGG16 base network for classification. The Research focuses on the problem of having a detector which is well trained and interfering on the traffic sign dataset it was trained for (I took the Belgium Traffic sign detection dataset) but when it Comes to use the detector in another Country (Germany, Austria, Italy, Spain,…) the traffic signs look more or less different which results in a certain unwanted loss. For an overview on this Topic, I would recommend the wikipedia article

~~~ the following section is about my Research questions ~~~

So having a couple of examples of traffic signs in the new Country, is it better to

  • fine tune the Network
  • Transfer learn the Network and freeze some of the convolution layers
  • (as a comparision) learn the new Country from scratch

Even the very first detector (the one trained from scratch on the comprehensive belgium dataset), does it have any advantage, to load pretrained weights from published model Zoos (for example VGG16/COCO) and then finetune/transferlearn based on this?

Now what am I asking here:
I've implemented my detector not on my own, but based upon an original SSD port to Keras/Tensorflow (from here) and already trained it with different variations (Belgium from scratch, pretrained with MS COCO, Transfer to Germany, Convolution frozen, fine tuned to Germany) and after weeks of Training now I can say, that Belgium with random weights from scratch is converging fastest (after only 40 epochs/2 days my custom SSD loss function is down to a value of 3) while all other variations need much more time, more epochs and loss is never falling below a value of 9.

I also found pretrained weights for traffic sign classification with VGG16 which I thought should be the ideal base for transfer learning on this topic, but this detector was the worst performing so far (loss stagnated at 11, even when learning rate is changed and after 100 epochs it overfitted).

It seems, that transfer learning or fine tuning on these detectors doesn't have any advantage at all. It's likely that I am doing anything wrong or that I get the purpose of transfer learning wrong (I thought that it should speed up learning, as most layers aren't trainable and therefore no calculation is done)

I don't know if this is the proper platform for discussion on this topic, perhaps you know a slack or gitter channel which this belongs to. I just don't know if I am stuck, or I am just doing something horribly wrong.

Best Answer

Transfer learning is when a model developed for one task is reused to work on a second task. Fine-tuning is one approach to transfer learning where you change the model output to fit the new task and train only the output model.

In Transfer Learning or Domain Adaptation, we train the model with a dataset. Then, we train the same model with another dataset that has a different distribution of classes, or even with other classes than in the first training dataset).

In Fine-tuning, an approach of Transfer Learning, we have a dataset, and we use let's say 90% of it in training. Then, we train the same model with the remaining 10%. Usually, we change the learning rate to a smaller one, so it does not have a significant impact on the already adjusted weights. You can also have a base model working for a similar task and then freezing some of the layers to keep the old knowledge when performing the new training session with the new data. The output layer can also be different and have some of it frozen regarding the training.

In my experience learning from scratch leads to better results, but it is much costly than the others especially regarding time and resources consumption.

Using Transfer Learning you should freeze some layers, mainly the pre-trained ones and only train in the added ones, and decrease the learning rate to adjust the weights without mixing their meaning for the network. If you speed up the learning rate you normally face yourself with poor results due to the big steps in the gradient descent optimisation. This can lead to a state where the neural network cannot find the global minimum but only a local one.

Using a pre-trained model in a similar task, usually have great results when we use Fine-tuning. However, if you do not have enough data in the new dataset or even your hyperparameters are not the best ones, you can get unsatisfactory results. Machine learning always depends on its dataset and network's parameters. And in that case, you should only use the "standard" Transfer Learning.

So, we need to evaluate the trade-off between the resources and time consumption with the accuracy we desire, to choose the best approach.

Related Question