What happens if a convex objective is optimized by stochastic gradient descent?
Is a global solution achieved?
[Math] Stochastic gradient descent for convex optimization
convex optimizationconvex-analysisgradient descentmachine learningnumerical methods
Related Solutions
Suppose we have the following objective function
$$f (x) := \frac{1}{2} x^T Q x - r^T x$$
where $Q \in \mathbb R^{n \times n}$ is symmetric and positive definite and $r \in \mathbb R^n$. From the symmetry of $Q$, we conclude that its eigenvalues are real. From the positive definiteness of $Q$, we conclude that its eigenvalues are positive and that $f$ is a convex function.
Taking the gradient of $f$,
$$\nabla f (x) = Q x - r$$
The gradient vanishes at $\bar{x} := Q^{-1} r$, which is the unique solution to the linear system $Q x = r$.
Doing gradient descent,
$$\begin{array}{rl} x_{k+1} &= x_k - \alpha \nabla f (x_k)\\\\ &= x_k - \alpha Q x_k + \alpha r\\\\ &= (I_n - \alpha Q) x_k + \alpha r\end{array}$$
Hence,
$$x_k = \bar{x} + (I_n - \alpha Q)^k (x_0 - \bar{x})$$
Convergence to $\bar{x}$ happens when
$$|1 - \alpha \lambda_i (Q)| < 1$$
for all $i \in \{1,2,\dots,n\}$, i.e., when
$$0 < \alpha < \frac{2}{\lambda_{\max} (Q)} =: \alpha_{\max}$$
Evaluating the gradient,
$$\begin{array}{rl} \nabla f (x_{k+1}) &= Q x_{k+1} - r\\\\ &= Q (I_n - \alpha Q) x_k + \alpha Q r - r\\\\ &= (I_n - \alpha Q) (Q x_k - r)\\\\ &= (I_n - \alpha Q) \nabla f (x_k)\end{array}$$
Thus,
$$\|\nabla f (x_{k+1})\|_2 \leq \|I_n - \alpha Q\|_2 \|\nabla f (x_k)\|_2$$
If $0 < \alpha < \alpha_{\max}$, then $|1 - \alpha \lambda_i (Q)| < 1$ and $\|I_n - \alpha Q\|_2 < 1$. Thus, if we have convergence, then the $2$-norm of the gradient contracts with each iteration
$$\|\nabla f (x_{k+1})\|_2 < \|\nabla f (x_k)\|_2$$
I think that "batch gradient descent" is just another name for "gradient descent". And "mini-batch gradient descent" is just another name for the mini-batch version of stochastic gradient descent (which I would call mini-batch SGD).
So below I'll explain the meaning of gradient descent, stochastic gradient descent (with a mini-batch size of $1$), and mini-batch stochastic gradient descent.
Suppose we're minimizing $$ \tag{1} f(x) = \frac{1}{N} \sum_{i=1}^N f_i(x). $$ Let's assume the functions $f_i: \mathbb R^n \to \mathbb R$ are differentiable. The gradient descent iteration is $$ x^{k+1} = x^k - t \nabla f(x^k). $$ Here $t$ is the step size, also known as the learning rate, for our optimization algorithm.
In stochastic gradient descent (with a mini-batch size of $1$), each time we update $x$ we compute the gradient using only one of the terms, selected at random, from the big sum (1). So in SGD we update $x$ as follows: $$ x^{k+1} = x^k - t \nabla f_i(x^k), $$ where $i$ is selected at random from $\{1,2, \ldots, N \}$. The index $i$ can be selected either with replacement or without replacement. I believe without replacement is more common and tends to work a little better.
In the mini-batch version of stochastic gradient descent, with a mini-batch size of $M$, each time we update $x$ we compute the gradient using only $M$ of the terms, selected at random, from the big sum (1). So the minibatch SGD update is $$ x^{k+1} = x^k - t \left( \frac{1}{M} \sum_{i \in S_k} \nabla f_i(x^k) \right), $$ where $S_k$ is a randomly selected $M$-element subset of $\{1, 2, \ldots, N \}$. Usually the subset $S_k$ is not allowed to intersect with any of the previous subsets $S_1, \ldots S_{k-1}$ until we have completely exhausted the set of possible indices $S = \{1,2, \ldots, N \}$, at which point we make another full sweep through $S$, then another full sweep through $S$, and so on. Each full sweep through $S$ is called one "epoch".
When using stochastic gradient descent, often people use a diminishing step size strategy in order to guarantee convergence. SGD with a fixed step size does not converge, although in practice it might find a good approximate minimizer of (1).
Best Answer
Eventually :). A convex objective function has no local minima, so, for every point in the domain, the integral curve of the gradient vector field from that point leads to the global minimum and SGD approximately follows the integral curve as long as the learning rate is small enough. It is not even necessary to track the integral curve very closely, since a gradient integral curve leads the global minimum from wherever you land, you just need to keep moving downhill. The real issue is the rate of progress, which can easily stall in regions with small gradient. Avoiding this also depends on the learning rate, making the learning rate critical to the practical use of SGD. I believe there are theoretical guarantees of convergence if you vary the learning rate inversely with time. The best place to look for specific results is in Leon Bottou's papers and the Pegasos paper. The current focus of research seems to be on varying the learning rate adaptively during training based on the learning progress thus far. An early version of this was the passive-aggressive perceptron. More recent methods like "natural gradient" and the AROW algorithm adaptively maintain a separate learning rate for each component of the gradient.