When training a neural network, what difference does it make to set:
- batch size to $a$ and number of iterations to $b$
- vs. batch size to $c$ and number of iterations to $d$
where $ ab = cd $?
To put it otherwise, assuming that we train the neural network with the same amount of training examples, how to set the optimal batch size and number of iterations? (where batch size * number of iterations = number of training examples shown to the neural network, with the same training example being potentially shown several times)
I am aware that the higher the batch size, the more memory space one needs, and it often makes computations faster. But in terms of performance of the trained network, what difference does it make?
Best Answer
From Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. https://arxiv.org/abs/1609.04836 :
Also, some good insights from Ian Goodfellow answering to why do not use the whole training set to compute the gradient? on Quora:
Related: Batch gradient descent versus stochastic gradient descent