I would like to understand the steps of mini-batch gradient descent for training a neural network.
My train data $(X,y)$ has dimension $(k \times n)$ and $(1 \times n)$, where $k$ is the number of the features and $n$ is the number of observations.
For each layer $l=1,…L$ my parameters are $W^{[l]}$ of dimension $(n^{[l]} \times n^{[l-1]})$, where $n^{[0]}=k$
a) First I randomly initialize the parameters $W^{[l]}$ for $l=1,…L$.
b) I take a sample of length $p\in[1,n]$ of my training data, denoted by $(X^{(1)},y^{(1)})$ for sample number $1$.
c) I compute the cost $J^{(1)}(W)$ with the first initialization of the parameters and the first sample of the train data.
d) In back-propagation I update the parameters for $l=L,…1$ according to a learning rate $\alpha$:
$$ W^{[l]} = W^{[l]} – \alpha \text{ } \frac{\partial J^{(1)}(W)}{\partial W^{[l]}}$$
Now I've done one step of the gradient descent with one sample of the train data. How does the algorithm continue?
Does it repeat step (c) and (d) with the "new" $W^{[l]}$ on a second sample of the train data $(X^{(2)},y^{(2)})$?
In this case, will it continue until convergence when every update in the gradient descent is done with different samples of the train data?
Please let me know if something is not clear.
Best Answer
TL;DR;
Yes, you are correct.
LONG ANSWER
I'll give you a bit of context
There are three main "kind" of Gradient Descent:
These algorithms differ for the dataset batch size.
Terminology
Batch Gradient Descent
If you are working with training data that can fit in memory (RAM / VRAM) the choice is on Batch Gradient Descent. In this case the batch size is equal to the entire dataset. This means that the model is updated only when all the dataset is passed.
Stochastic Gradient Descent
It is an estimate of Batch Gradient Descent. The batch size is equal to 1. This means that the model is updated with only a training instance at time.
Mini-batch Gradient Descent
It is a generalization of Stochastic Gradient Descent. The batch size is equal to a value >= 1. This means that the model is updated per batch.
Example: Just to be more clear, let's suppose to have a dataset of 1000 instances (n_of_instances), and let's say that for every kind of gradient descent we have a fixed number of epochs (n_of_epochs) equals to 100, and as batch size for mini batch gradient descent we have 100 (batch_size), and so 10 iterations (n_of_iterations = n_of_instances/batch_size = 1000/100 = 10).
The thumb rule is to use batch gradient descent if you can fit all the dataset in memory. On the contrary, depending on the instance size, the choice will be a mini-batch gradient descent with a fixed size batch that can fit entirely in memory. Usually when you use the mini-batch gradient descent the error convergence will be more noisy compared to batch gradient descent, because of the content variability of the batches
Some useful refs: