I understand the idea behind momentum, and how to implement it with batch gradient descent, but I'm not sure how to implement it with mini-batch gradient descent. As I understand it, implementing momentum in batch gradient descent goes like this:
for example in training_set:
calculate gradient for this example
accumulate the gradient
for w, g in weights, gradients:
w = w - learning_rate * g + momentum * gradients_at[-1]
Where gradients_at
records the gradients for each weight at backprop iteration t
.
Is this correct? If so, what modifications are necessary to apply this technique in mini-batch gradient descent?
Best Answer
The only difference between a batch and a mini-batch is that you're using part of the data set rather than the entire dataset during each epoch. Thus, you would calculate the gradient for only a subset of the samples in your training set and use these during each update epoch. Repeat for many epochs, where each epoch contains a different subset of the the full dataset.