Solved – the difference among stochastic, batch and mini-batch learning styles

backpropagationdeep learninggradient descentmachine learningneural networks

So far as I know, we have the following scenario:

stochastic: The error is calculated for each sample s. So, we can calculate the gradients for s. And we can update the weights of the network according to these gradients. An epoch involves a complete iteration over the whole training set. So, if we have N samples, in each epoch we will have N updates of the weights.

batch: The algorithm calculates the error of each sample s. And the gradients of each sample s are accumulated. At he the end of an epoch, we take the average of this accumulated gradient and we update the weights of the network. So, we have only one update of weights in each epoch.

mini-bach: The algorithm divides the training set in subsets called mini-batches. So, for each mini-batch m, the algorithm calculates the error of each sample s, and it accumulates the gradients of the samples in m. Once every sample of a mini-batch m is processed, we take the average of the accumulated gradient for m and we update the weights of the network. So, if we divide the training set in X mini-batches, at the end of each epoch we will have X updates of the weights of the network, one for each mini-batch.

Is this correct?

In the case of batch or mini-batch back-propagation we really use the "average gradient"? Or, instead, we only use the sum of the gradients that were calculated?

If this is not correct, what is the actual sequence of operations (error calculation, gradient calculation, update of weights) for each style of back-propagation?

Best Answer

Yes, your understanding is correct.

In the case of batch or mini-batch back-propagation we really use the "average ....

We should use the average gradient.

However, you can choose the learning rate and account for averaging. If you use sum, the division term can be subsumed in the learning rate however, learning rate will now be dependent on batch size. This is another practical reason to use the average.