Solved – Do neural networks usually take a while to “kick in” during training

classificationconv-neural-networkmachine learningneural networkstensorflow

I am trying to train a deep neural network for classification, using back propagation. Specifically, I am using a convolutional neural network for image classification, using the Tensor Flow library. During training, I am experiencing some strange behaviour, and I'm just wondering whether this is typical, or whether I may be doing something wrong.

So, my convolutional neural network has 8 layers (5 convolutional, 3 fully-connected). All weights and biases are initialised at small random numbers. I then set a step size, and proceed with training with mini-batches, using Tensor Flow's Adam Optimizer.

The strange behaviour I am talking about is that for about the first 10 loops through my training data, the training loss does not, in general, decrease. The weights are being updated, but the training loss stays roughly at about the same value, sometimes going up and sometimes going down between mini-batches. It stays this way for a while, and I always get the impression that the loss is never going to decrease.

Then, all of a sudden, the training loss dramatically decreases. For example, within about 10 loops through the training data, the training accuracy goes from about 20 % to about 80 %. From then onwards, everything ends up converging nicely. The same thing happens each time I run the training pipeline from scratch, and below is a graph illustrating one example run.

enter image description here

So, what I am wondering, is whether this is normal behaviour with training deep neural networks, whereby it takes a while to "kick in". Or is it likely that there is something I am doing wrong which is causing this delay?

Thanks very much!

Best Answer

The fact that the algorithm took a while to "kick-in" is not particularly surprising.

In general, the target function to be optimized behind neural networks are highly multi-modal. As such, unless you have some sort of clever set of initial values for your problem, there is no reason to believe that you will be starting on a steep descent. As such, your optimization algorithm will be almost randomly wandering until it finds a fairly steep valley to begin descending down onto. Once this has been found, you should expect most gradient-based algorithms to immediately begin narrowing into the particular mode that it is closest to.