Gradient Descent Optimization – Why Gradient Descent Explodes If Learning Rate is Too Large

computational-statisticsgradient descentleast squarespython

I've implemented my own gradient descent algorithm for an OLS, code below. It work's, however, when the learning rate is too large (i.e. learn_rate >= .3), my approach is unstable. The coefficient's explode and I get an overflow error. I understand that if my learning rate is too large, I get bad results. The algorithm will take too big of steps and continuously miss the optimia. However, given that the OLS loss function is a convex optimization problem, I'm surprised that the a large learning rate would cause explosive coefficient estimates. Any insight would be greatly appreciated (as well as coding suggestions, though I know this is not the right place for that kind of talk)

import pandas as pd
import numpy as np
y1 = np.array([0, 1, 2])
X1 = np.array([[0, 0], [1, 1], [2, 2]])

def gradient_descent(LHS, RHS, N, learn_rate, tol, params, max_iter):
    #Append intercept
    constant = np.ones((N, 1))
    RHS = np.hstack((constant, RHS))

    #Define Gradient - Using Matrix Notation
    def gradient(X, y, p):
        grad_part1 = -2*np.dot(np.transpose(X),y)
        grad_part2 = 2*np.dot(np.dot(np.transpose(X), X), p)
        return((1/len(y))*(grad_part1 + grad_part2))

    #Define Gradient update
    def param_update(p,lr,X,y):
        return(p - lr*gradient(X, y, p))

    #Check if we start at optimia
    old_params = params
    params = param_update(params, learn_rate, RHS, LHS)

    if all(abs(params - old_params) <= tol):
        return(params)
    #If not, run gradient descent
    else:
        iter = 1
        while(any(abs(params - old_params)) > tol and iter < max_iter):
            old_params = params
            params = param_update(params, learn_rate, RHS, LHS)
            iter += 1
        return([params, iter])

LHS = y1.reshape(len(y1),1)
RHS = X1

myres = gradient_descent(LHS, RHS, len(LHS), .1, .1, np.array([[1], [1], [1]]), 10000)
myres = gradient_descent(LHS, RHS, len(LHS), .3, .1, np.array([[1], [1], [1]]), 10000)

Best Answer

The learning rate can seen as step size, $\eta$. As such, gradient descent is taking successive steps in the direction of the minimum. If the step size $\eta$ is too large, it can (plausibly) "jump over" the minima we are trying to reach, ie. we overshoot. This can lead to osculations around the minimum or in some cases to outright divergence. It is important to note that the step gradient descent takes is a function of step size $\eta$ as well as the gradient values $g$. If we are in a local minimum with zero gradient the algorithm will not update the parameters $p$ because the gradient is zero, similarly if $p$ is in a "steep slope", even a small $\eta$ will lead to a large update in $p$'s values.

Particular for the case of divergence what happens is that as soon as an oversized step $\eta$ is taken from an initial point $p_{i=0}$, the gradient descent algorithm lands to a point $p_{i=1}$ that is worse than $p_{i=0}$ in terms of cost. At this new but cost function-wise worse point $p_{i=1}$, when recalculating the gradients, the gradient values are increased, so the next (hopefully corrective) step is even larger. Nevertheless if this next step leads to a point $p_{i=2}$ with even larger error because we overshoot again, we can be led to use even larger gradient values, leading ultimately to a vicious cycle of ever increasing gradient values and "exploding coefficients" $p_i$.

In the code you provided you might wish add a print(gradient(X, y, p)) statement in the param_update function. If that is add, we can monitor the gradient in each iteration and see that in the case of a reasonably valued $\eta$ the gradient values slowly decrease while in the case of unreasonably large $\eta$ the gradient values get steadily larger and larger.

Advanced variants of gradient descent use the concept to adaptive learning rate, the optimisation algorithm Adadelta is a famous example of this. We might wish to play with a toy version of this notion by using a steadily decreasing step size. Assuming that we start with $\eta = \eta_0$, we can scale the step size $\eta_t$ used for the $t$ iteration according to: $\eta_t = \frac{\eta_0}{t}$. Notice you if we adaptively decrease $\eta$ we need to start with a reasonably large $\eta_0$ (say 1.0 for your example).

Related Question