Solved – How to set the step size for stochastic gradient descent such that its provable it will converge

gradient descentmachine learningregression

Recall stochastic gradient descent (for regression):

$\theta = 0 $

$ \text{Randomly select } t \in [1,n]\{\\
\quad \theta^{(k+1)} = \theta^{k} + \eta_{k}(y^{(t)} – \theta \cdot x^{(t)})x^{(t)}\\
\}$

I was following some notes and it said that to have stochastic gradient descent converge it would be sufficient to set the learning rate to:

$$\eta_{k} = \frac{1}{k+1}$$

I normally post my attempts to a question, but I honestly don't know how to prove that setting that value will make stochastic gradient descent converge. If someone knows of a prove it would be awesome!

Also an answer that contains a proof, even if its for some special case, say "convex function" or in the case of linear regression or whatever the case is, but that provides some insight (or a proof) why this result holds for different setting of stochastic gradient descent, would be good answers!

I am aware that finding global minimum solutions is really hard anyway, so restricting your answer should be ok if it contains a good proof/argument and/or if it contains an example in machine learning.

Also, if you do not remember the proof but instead maybe have a good intuition on why that value is a good for stochastic gradient descent, that would also be a very useful response for me! 🙂

Thanks in advance!

Best Answer

First of all, you won't find a proof of this in the general case. Proofs of convergence in batch/stochastic gradient descent algorithms rely on convexity/strong convexity hypotheses.

In the case of stochastic gradient descent, a theorem is that if the objective function is convex and the learning rate $\eta_t$ is such that

$$\sum_t \eta_t = +\infty \quad \text{and} \quad \sum_t \eta_t^2 < + \infty$$

Then stochastic gradient converges almost surely to the global minimum. (Robbins-Siegmund theorem if I recall). The proof is nontrivial and makes use of results in the theory of stochastic process & martingale theory. This is the case for any convergence results for SGD.

Your stepsize clearly checks this condition, although typically one chooses a step of the form

$$\frac{\sigma}{(1 + \sigma \lambda t)^{3/4}}$$

Where $\sigma$ is the initial learning rate and $\lambda$ governs asymptotic convergence speed.