Solved – How to update weights in a neural network using gradient descent with mini-batches

backpropagationgradient descentmachine learningneural networks

How does gradient descent work for training a neural network if I choose mini-batch (i.e., sample a subset of the training set)? I have thought of three different possibilities:

  1. Epoch starts. We sample and feedforward one minibatch only, get the error and backprop it, i.e. update the weights. Epoch over.

  2. Epoch starts. We sample and feedforward a minibatch, get the error and backprop it, i.e. update the weights. We repeat this until we have sampled the full data set. Epoch over.

  3. Epoch starts. We sample and feedforward a minibatch, get the error and store it. We repeat this until we have sampled the full data set. We somehow average the errors and backprop them by updating the weights. Epoch over.

Best Answer

Mini-batch is implemented basically as you describe in 2.

  1. Epoch starts. We sample and feedforward a minibatch, get the error and backprop it, i.e. update the weights. We repeat this until we have sampled the full data set. Epoch over.

Assuming that the network is minimizing the following objective function: $$ \frac{\lambda}{2}||\theta||^2 + \frac{1}{n}\sum_{i=1}^n E(x^{(i)}, y^{(i)}, \theta) $$

This is essentially the weights update step

$$ \theta = (1 - \alpha \lambda) \theta - \alpha \frac{1}{b}\sum_{k=i}^{i+b-1} \frac{\partial E}{\partial \theta}(x^{(k)}, y^{(k)}, \theta) $$

where the following symbols mean:

$E$ = the error measure (also sometimes denoted as cost measure $J$)

$\theta$ = weights

$\alpha$ = learning rate

$1 - \alpha \lambda$ = weight decay

$b$ = batch size

$x$ = variables

You loop over the consecutive batches (i.e. increment by $b$) and update the weights. This more frequent weight updating combined with vectorization is what allows mini-batch gradient descent to tend to converge more quickly than either generic batch of stochastic methods.