Gradient Descent and its Variants

gradient descentmachine learningnumerical optimizationoptimization

I know there is a lot of topic regarding this on the internet, and trust me, I've googled it. But things are getting more and more confused for me.

From my understanding, Optimization refers to the task of minimizing/maximizing an objective function $f(x)$ parameterized by $x$. In machine/deep learning terminology, it’s the task of minimizing the cost/loss function $J(w)$ parameterized by the model’s parameters $w ∈ R^d$.

Gradient Descent is the most common optimization algorithm in machine learning and deep learning. It is a first-order optimization algorithm. This means it only takes into account the first derivative when performing the updates on the parameters.

Variants of Gradient Descent:
Batch Gradient Descent,
Mini-batch Gradient Descent, and
Stochastic Gradient Descent.

Could anyone explain in simple words (and maybe with an example/Math behind) how Batch Gradient Descent, Mini-batch Gradient Descent, and Stochastic Gradient Descent works & difference between them?

Best Answer

I think that "batch gradient descent" is just another name for "gradient descent". And "mini-batch gradient descent" is just another name for the mini-batch version of stochastic gradient descent (which I would call mini-batch SGD).

So below I'll explain the meaning of gradient descent, stochastic gradient descent (with a mini-batch size of $1$), and mini-batch stochastic gradient descent.


Suppose we're minimizing $$ \tag{1} f(x) = \frac{1}{N} \sum_{i=1}^N f_i(x). $$ Let's assume the functions $f_i: \mathbb R^n \to \mathbb R$ are differentiable. The gradient descent iteration is $$ x^{k+1} = x^k - t \nabla f(x^k). $$ Here $t$ is the step size, also known as the learning rate, for our optimization algorithm.

In stochastic gradient descent (with a mini-batch size of $1$), each time we update $x$ we compute the gradient using only one of the terms, selected at random, from the big sum (1). So in SGD we update $x$ as follows: $$ x^{k+1} = x^k - t \nabla f_i(x^k), $$ where $i$ is selected at random from $\{1,2, \ldots, N \}$. The index $i$ can be selected either with replacement or without replacement. I believe without replacement is more common and tends to work a little better.

In the mini-batch version of stochastic gradient descent, with a mini-batch size of $M$, each time we update $x$ we compute the gradient using only $M$ of the terms, selected at random, from the big sum (1). So the minibatch SGD update is $$ x^{k+1} = x^k - t \left( \frac{1}{M} \sum_{i \in S_k} \nabla f_i(x^k) \right), $$ where $S_k$ is a randomly selected $M$-element subset of $\{1, 2, \ldots, N \}$. Usually the subset $S_k$ is not allowed to intersect with any of the previous subsets $S_1, \ldots S_{k-1}$ until we have completely exhausted the set of possible indices $S = \{1,2, \ldots, N \}$, at which point we make another full sweep through $S$, then another full sweep through $S$, and so on. Each full sweep through $S$ is called one "epoch".

When using stochastic gradient descent, often people use a diminishing step size strategy in order to guarantee convergence. SGD with a fixed step size does not converge, although in practice it might find a good approximate minimizer of (1).

Related Question