When training a artificial neural network using stochastic gradient descent with mini-batches, if the data set size is not a multiple of mini-batches, should the last mini-batch contains fewer samples? Or instead is it preferable to have the last mini-batch contain the same number of samples as the other batches, by randomly adding samples from other batches (which is the strategy used here and here)?
Solved – When the data set size is not a multiple of the mini-batch size, should the last mini-batch be smaller, or contain samples from other batches
deep learninggradient descentneural networks
Best Answer
Same number, otherwise you're putting more weight on the samples in the final minibatch (unless you scale down the learning weight to match the smaller size).
Adding random samples from the training set should be fine too (as long as your sampling pool includes the runt minibatch), since each sample has an equal chance of being seen twice in an epoch.
Or just do a modulo and grab samples from the beginning again.
In practice, it probably doesn't matter much.