[Math] Decrease in the size of gradient in gradient descent

convex optimizationconvex-analysisgradient descentnumerical optimizationoptimization

Gradient descent reduces the value of the objective function in each iteration. This is repeated until convergence happens.

The question is if the norm of gradient has to decrease as well in every iteration of gradient descent?

Edit: How about when the objective is a convex function?

Best Answer

Suppose we have the following objective function

$$f (x) := \frac{1}{2} x^T Q x - r^T x$$

where $Q \in \mathbb R^{n \times n}$ is symmetric and positive definite and $r \in \mathbb R^n$. From the symmetry of $Q$, we conclude that its eigenvalues are real. From the positive definiteness of $Q$, we conclude that its eigenvalues are positive and that $f$ is a convex function.

Taking the gradient of $f$,

$$\nabla f (x) = Q x - r$$

The gradient vanishes at $\bar{x} := Q^{-1} r$, which is the unique solution to the linear system $Q x = r$.

Doing gradient descent,

$$\begin{array}{rl} x_{k+1} &= x_k - \alpha \nabla f (x_k)\\\\ &= x_k - \alpha Q x_k + \alpha r\\\\ &= (I_n - \alpha Q) x_k + \alpha r\end{array}$$

Hence,

$$x_k = \bar{x} + (I_n - \alpha Q)^k (x_0 - \bar{x})$$

Convergence to $\bar{x}$ happens when

$$|1 - \alpha \lambda_i (Q)| < 1$$

for all $i \in \{1,2,\dots,n\}$, i.e., when

$$0 < \alpha < \frac{2}{\lambda_{\max} (Q)} =: \alpha_{\max}$$

Evaluating the gradient,

$$\begin{array}{rl} \nabla f (x_{k+1}) &= Q x_{k+1} - r\\\\ &= Q (I_n - \alpha Q) x_k + \alpha Q r - r\\\\ &= (I_n - \alpha Q) (Q x_k - r)\\\\ &= (I_n - \alpha Q) \nabla f (x_k)\end{array}$$

Thus,

$$\|\nabla f (x_{k+1})\|_2 \leq \|I_n - \alpha Q\|_2 \|\nabla f (x_k)\|_2$$

If $0 < \alpha < \alpha_{\max}$, then $|1 - \alpha \lambda_i (Q)| < 1$ and $\|I_n - \alpha Q\|_2 < 1$. Thus, if we have convergence, then the $2$-norm of the gradient contracts with each iteration

$$\|\nabla f (x_{k+1})\|_2 < \|\nabla f (x_k)\|_2$$

Best Answer

Related Solutions

[Math] Projected gradient descent with momentum

Gradient Descent (Geometric) – Why find ascent/descent in first iteration

Related Question