Solved – For gradient descent, does there always a exist a step size such that the cost of the training error you are trying to minimize never increase

gradient descent

Consider running gradient descent to minimize ERM/training error (empirical risk) and denote it with $ J(\theta, X, y)$ for data $X$,$y$ and parameter/function $ \theta $. Recall gradient descent:

$$ \theta^{(t)} := \theta^{(t-1)} – \gamma_{t-1} \frac{\partial J(\theta^{(t-1)}, X, y)}{ \partial \theta^{(t-1)}} $$

were $t$ denotes iterations.

My question is, does there always $\exists, \gamma_{t-1}$ (whether its fixed or a function of iterations $t$) such that gradient descent never chooses/updates parameters in such a way that the training error $J$ increases?

My intuition tells me that if $\gamma_{t-1}$ is sufficiently small, gradient descent should (provably) never increase the cost of the function you are optimizing. The reason I think this is that one can imagine having a really large step size $\gamma_{t-1}$, where we would bounce around the minimum, potentially overshooting and increase the cost. So my conjecture is that yes, such a step size $\gamma_{t-1}$ should exist but I wanted to confirm this with the community or even see a link to a prove of this claim (it seems if its true and gradient descent is so widely used, that this type of question probably was already studied) or maybe even maybe provide a proof if they know one.

Best Answer

assuming the error function is twice continuously differentiable and coercive (error goes to infinity as parameters theta go to infinity - eg if you L1/L2 norm regularisation on the parameters) then I believe the answer is yes, and I will sketch a proof, which hopefully identifies the key concepts involved.

I will drop $X,y$ to simplify notation

Coercivity basically allows you to look at a finite area in your parameter set (mathematical term compact set): calculate your error at parameter Theta=0, then you are only interested in thetas with regularisation norm less than error(Theta=0) [eg if $J(\theta)=error(\theta) + \alpha ||\theta||^2$ then we are only interested in $||\theta||^2\le error(0)/\alpha=:K$ since outside this region the regularisation term alone leads to higher $J(\theta)$].

Applying Taylor's theorem with remainder to the change $J(\theta^{(t)}) - J(\theta^{(t-1)})$ with stepsize $\gamma$:

$J(\theta^{(t)}) - J(\theta^{(t-1)}) = -\gamma \nabla J(\theta^{(t-1)})\cdot \nabla J(\theta^{(t-1)})+ \gamma^2 \nabla J(\theta^{(t-1)})^T H((1-\eta)\theta^{(t-1)}+\eta \theta^{(t-1)})\cdot \nabla J(\theta^{(t-1)})$

Here $H$ is the matrix of second derivatives wrt $\theta$ and $\eta$ is an unknown term strictly between 0 and 1 given to us by Taylor's theorem.

So to ensure this change is positive we require $\gamma < \frac {2}{max_{\|\theta\|^2\le K}\sigma(H)}$

where $\sigma(H)$ is the maximum eigenvalue of $H(\theta)$.

Because $\|\theta\|^2\le K$ is a compact set and $\sigma(H)$ is a continuous function of $\theta$ then a (finite) maximum exists over the region bounded by K (extreme value theorem), and so we can find a corresponding small enough step size.