Optimization – Understanding Mini-Batch Gradient Descent in Machine Learning

gradient descentmachine learningoptimization

I would like to understand the steps of mini-batch gradient descent for training a neural network.

My train data $(X,y)$ has dimension $(k \times n)$ and $(1 \times n)$, where $k$ is the number of the features and $n$ is the number of observations.

For each layer $l=1,…L$ my parameters are $W^{[l]}$ of dimension $(n^{[l]} \times n^{[l-1]})$, where $n^{[0]}=k$

a) First I randomly initialize the parameters $W^{[l]}$ for $l=1,…L$.

b) I take a sample of length $p\in[1,n]$ of my training data, denoted by $(X^{(1)},y^{(1)})$ for sample number $1$.

c) I compute the cost $J^{(1)}(W)$ with the first initialization of the parameters and the first sample of the train data.

d) In back-propagation I update the parameters for $l=L,…1$ according to a learning rate $\alpha$:
$$ W^{[l]} = W^{[l]} – \alpha \text{ } \frac{\partial J^{(1)}(W)}{\partial W^{[l]}}$$

Now I've done one step of the gradient descent with one sample of the train data. How does the algorithm continue?

Does it repeat step (c) and (d) with the "new" $W^{[l]}$ on a second sample of the train data $(X^{(2)},y^{(2)})$?

In this case, will it continue until convergence when every update in the gradient descent is done with different samples of the train data?

Please let me know if something is not clear.

Best Answer

TL;DR;

Yes, you are correct.


LONG ANSWER

I'll give you a bit of context

There are three main "kind" of Gradient Descent:

  • Batch Gradient Descent
  • Stochastic Gradient Descent
  • Mini-batch Gradient Descent

These algorithms differ for the dataset batch size.

Terminology

  • epochs: epochs is the number of times when the complete dataset is passed forward and backward by the learning algorithm
  • iterations: the number of batches needed to complete one epoch
  • batch size: is the size of a dataset set sample

Batch Gradient Descent

If you are working with training data that can fit in memory (RAM / VRAM) the choice is on Batch Gradient Descent. In this case the batch size is equal to the entire dataset. This means that the model is updated only when all the dataset is passed.

for epoch in number of epochs:
    - for all the training instances in the dataset compute the derivative of the cost function
    - update the weights

Stochastic Gradient Descent

It is an estimate of Batch Gradient Descent. The batch size is equal to 1. This means that the model is updated with only a training instance at time.

for epoch in number of epochs:
    for instance in total dataset:
        - for the current instance compute the derivative of the cost function 
        - update the weights

Mini-batch Gradient Descent

It is a generalization of Stochastic Gradient Descent. The batch size is equal to a value >= 1. This means that the model is updated per batch.

for epoch in number of epochs:
    for batch in num of batches:
        - for all the training instances in the batch sample compute the derivative of the cost function
        - update the weights

Example: Just to be more clear, let's suppose to have a dataset of 1000 instances (n_of_instances), and let's say that for every kind of gradient descent we have a fixed number of epochs (n_of_epochs) equals to 100, and as batch size for mini batch gradient descent we have 100 (batch_size), and so 10 iterations (n_of_iterations = n_of_instances/batch_size = 1000/100 = 10).

  • Batch Gradient Descent: the model will be updated 100 times (n_of_epochs)
  • Stochastic Gradient Descent: the model will be updated 100.000 times (n_of_epochs * n_of_instances = 100 * 1000)
  • Mini-batch Gradient Descent: the modell will be updated 1000 times (n_of_iterations * n_of_epochs = 10 * 100)

The thumb rule is to use batch gradient descent if you can fit all the dataset in memory. On the contrary, depending on the instance size, the choice will be a mini-batch gradient descent with a fixed size batch that can fit entirely in memory. Usually when you use the mini-batch gradient descent the error convergence will be more noisy compared to batch gradient descent, because of the content variability of the batches

enter image description here

Some useful refs: