Solved – For gradient descent, does there always a exist a step size such that the cost of the training error you are trying to minimize never increase

gradient descent

Consider running gradient descent to minimize ERM/training error (empirical risk) and denote it with $ J(\theta, X, y)$ for data $X$,$y$ and parameter/function $ \theta $. Recall gradient descent:

$$ \theta^{(t)} := \theta^{(t-1)} – \gamma_{t-1} \frac{\partial J(\theta^{(t-1)}, X, y)}{ \partial \theta^{(t-1)}} $$

were $t$ denotes iterations.

My question is, does there always $\exists, \gamma_{t-1}$ (whether its fixed or a function of iterations $t$) such that gradient descent never chooses/updates parameters in such a way that the training error $J$ increases?

My intuition tells me that if $\gamma_{t-1}$ is sufficiently small, gradient descent should (provably) never increase the cost of the function you are optimizing. The reason I think this is that one can imagine having a really large step size $\gamma_{t-1}$, where we would bounce around the minimum, potentially overshooting and increase the cost. So my conjecture is that yes, such a step size $\gamma_{t-1}$ should exist but I wanted to confirm this with the community or even see a link to a prove of this claim (it seems if its true and gradient descent is so widely used, that this type of question probably was already studied) or maybe even maybe provide a proof if they know one.

Best Answer

assuming the error function is twice continuously differentiable and coercive (error goes to infinity as parameters theta go to infinity - eg if you L1/L2 norm regularisation on the parameters) then I believe the answer is yes, and I will sketch a proof, which hopefully identifies the key concepts involved.

I will drop $X,y$ to simplify notation

Coercivity basically allows you to look at a finite area in your parameter set (mathematical term compact set): calculate your error at parameter Theta=0, then you are only interested in thetas with regularisation norm less than error(Theta=0) [eg if $J(\theta)=error(\theta) + \alpha ||\theta||^2$ then we are only interested in $||\theta||^2\le error(0)/\alpha=:K$ since outside this region the regularisation term alone leads to higher $J(\theta)$].

Applying Taylor's theorem with remainder to the change $J(\theta^{(t)}) - J(\theta^{(t-1)})$ with stepsize $\gamma$:

$J(\theta^{(t)}) - J(\theta^{(t-1)}) = -\gamma \nabla J(\theta^{(t-1)})\cdot \nabla J(\theta^{(t-1)})+ \gamma^2 \nabla J(\theta^{(t-1)})^T H((1-\eta)\theta^{(t-1)}+\eta \theta^{(t-1)})\cdot \nabla J(\theta^{(t-1)})$

Here $H$ is the matrix of second derivatives wrt $\theta$ and $\eta$ is an unknown term strictly between 0 and 1 given to us by Taylor's theorem.

So to ensure this change is positive we require $\gamma < \frac {2}{max_{\|\theta\|^2\le K}\sigma(H)}$

where $\sigma(H)$ is the maximum eigenvalue of $H(\theta)$.

Because $\|\theta\|^2\le K$ is a compact set and $\sigma(H)$ is a continuous function of $\theta$ then a (finite) maximum exists over the region bounded by K (extreme value theorem), and so we can find a corresponding small enough step size.

Related Solutions

Solved – How to set the step size for stochastic gradient descent such that its provable it will converge

First of all, you won't find a proof of this in the general case. Proofs of convergence in batch/stochastic gradient descent algorithms rely on convexity/strong convexity hypotheses.

In the case of stochastic gradient descent, a theorem is that if the objective function is convex and the learning rate $\eta_t$ is such that

$$\sum_t \eta_t = +\infty \quad \text{and} \quad \sum_t \eta_t^2 < + \infty$$

Then stochastic gradient converges almost surely to the global minimum. (Robbins-Siegmund theorem if I recall). The proof is nontrivial and makes use of results in the theory of stochastic process & martingale theory. This is the case for any convergence results for SGD.

Your stepsize clearly checks this condition, although typically one chooses a step of the form

$$\frac{\sigma}{(1 + \sigma \lambda t)^{3/4}}$$

Where $\sigma$ is the initial learning rate and $\lambda$ governs asymptotic convergence speed.

Solved – Gradient descent and linear regression fail to obtain negative slope

The issue is with the data range. For your data range you get bitten by numerical precision issues.

Here is an example R code reproducing the issue:

gendata <- function(const) {
    set.seed(13)
    x <- rnorm(100)*const
    y <- 3 - 2*x + rnorm(100) * const
    list(x = x, y = y)
}

gr_desc <- function(x, y, alpha, numIter) {
    X <- cbind(1,x)
    theta <- matrix(0, nrow = 2, ncol = numIter)
    gr <- matrix(NA, nrow = 2, ncol = numIter)
    n <- length(y)

    for (i in 2:numIter) {
        z <- X %*% theta[, i - 1]
        gr[, i] <- crossprod(X, y - z)
        theta[, i] <- theta[,i - 1] + alpha*gr[, i]/n
    }
    list(theta = theta, gr = gr)
}

par(mfrow=c(2,2))
d1 <- gendata(1)
a1 <- gr_desc(d1$x, d1$y, 0.1, 1000)
plot(t(a1$theta), main = "const = 1, learn_rate = 0.1")

d2 <- gendata(10)
a1 <- gr_desc(d2$x, d2$y, 0.1/10, 1000)
plot(t(a1$theta), main = "const = 10, learn rate = 0.1/10")

d3 <- gendata(100)
a1 <- gr_desc(d3$x, d3$y, 0.1/100, 1000)
plot(t(a1$theta), main = "const = 100, learn_rate = 0.1/100")

d4 <- gendata(1000)
a1 <- gr_desc(d4$x, d4$y, 0.1/1000, 1000)
plot(t(a1$theta), main = "const = 1000, learn rate = 0.1/1000")

I create the same data set, only multiply it by the constant, 1, 10, 100 and 1000.

For this reason it is always advisable to standardize your data. Numerical optimisation methods usually use some sort of safeguards to avoid such issues.

Best Answer

Related Solutions

Solved – How to set the step size for stochastic gradient descent such that its provable it will converge

Solved – Gradient descent and linear regression fail to obtain negative slope

Related Question