I am training a neural network using i) SGD and ii) Adam Optimizer. When using normal SGD, I get a smooth training loss vs. iteration curve as seen below (the red one). However, when I used the Adam Optimizer, the training loss curve has some spikes. What's the explanation of these spikes?
Model Details:
14 input nodes -> 2 hidden layers (100 -> 40 units) -> 4 output units
I am using default parameters for Adam beta_1 = 0.9
, beta_2 = 0.999
, epsilon = 1e-8
and a batch_size = 32
.
Best Answer
The spikes are an unavoidable consequence of Mini-Batch Gradient Descent in Adam (
batch_size=32
). Some mini-batches have 'by chance' unlucky data for the optimization, inducing those spikes you see in your cost function using Adam. If you try stochastic gradient descent (same as usingbatch_size=1
) you will see that there are even more spikes in the cost function. The same doesn´t happen in (Full) Batch GD because it uses all training data (i.e the batch size is equal to the cardinality of your training set) each optimization epoch. As in your first graphic the cost is monotonically decreasing smoothly it seems the title (i) With SGD) is wrong and you are using (Full) Batch Gradient Descent instead of SGD.On his great Deep Learning course at Coursera, Andrew Ng explains in great details this using the image below: