[Math] Setting the gradient to 0 gives a minimum

gradient descentlinear algebraoptimizationvector-spaces

Going through the math of machine learning, gradient descent, linear regression, etc…and I think I am just getting choked up on words at this point. So I understand in gradient descent, taking the gradient points towards the steepest ascent. The one thing I am getting caught up on is when we are trying to minimize a cost function J in least squares regression we set min dJ = 0.

Couldn't setting the derivative to 0 potentially give us a maximum? How are we guaranteed that setting the cost function derivative to 0 gives us a minimum. Is there an intuitive explanation? Have not had much luck wrapping my head around this so far.

Best Answer

Consider minimizing $f\colon\mathbb{R}^n\to\mathbb{R}$ over $\mathbb{R}^n$, i.e. without constraints. Assuming $f$ is twice differentiable and denoting by $\nabla$ and $\nabla^2$ the gradient and Hessian operators, there are three optimality conditions we generally care about:

First order necessary condition: If $x^*\in\mathbb{R}^n$ is a local minimum of $f$, then $\nabla f(x^*) = 0$.
Second order necessary condition: If $x^*\in\mathbb{R}^n$ is a local minimum of $f$, then $\nabla^2 f(x^*) \succeq 0$.
Second order sufficient condition: Let $x^*\in\mathbb{R}^n$. If $\nabla f(x^*) = 0$ and $\nabla^2 f(x^*) \succ 0$, then $x^*$ is a local minimum of $f$.

To gain some intuition on the first order necessary condition, let's look at the Taylor series expansion of $f$ about a point $x^*\in\mathbb{R}^n$:

\begin{equation*} f(x) \approx f(x^*) + \nabla f(x^*)^\top(x-x^*) + (x-x^*)^\top\nabla^2 f(x^*)(x-x^*). \end{equation*}

Suppose $\nabla f(x^*) \ne 0$. Then for $x$ close to $x^*$, the first order term dominates, and $f(x) - f(x^*) \approx \nabla f(x^*)^\top (x-x^*)$. Considering $x$ to be a nearby point to $x^*$ in the negative gradient direction, we set $x = x^* - \epsilon\nabla f(x^*)$ for some $\epsilon>0$. Therefore we find that $f(x)-f(x^*) \approx -\epsilon \|\nabla f(x^*)\|_2^2 < 0$, showing that this nearby point $x$ actually has lower objective value than $x^*$. We conclude that $x^*$ is not a local minimum when $\nabla f(x^*) \ne 0$, which is equivalent to the first order necessary condition.

Now let's look at the second order necessary condition. Let $\nabla f(x^*) = 0$. Then the Taylor series expansion gives $f(x) - f(x^*) \approx (x-x^*)^\top \nabla^2 f(x^*) (x-x^*)$. Suppose $\nabla^2 f(x^*) \nsucceq 0$. Then there exists a vector $v\in\mathbb{R}^n$ such that $v^\top \nabla^2 f(x^*) v < 0$. Therefore, choose $x$ to be a point nearby $x^*$ in the direction of $v$, i.e. $x = x^* + \epsilon v$ for some $\epsilon>0$. Then the Taylor series expansion yields $f(x)-f(x^*) \approx \epsilon^2 v^\top \nabla^2 f(x^*) v < 0$, showing that this nearby point $x$ has lower objective value that $x^*$. We again conclude that $x^*$ cannot be a local minimum, which is equivalent to the second order necessary condition.

Finally, let's gain some intuition on the second order sufficient condition. Let $\nabla f(x^*)=0$ and $\nabla^2 f(x^*) \succ 0$. Then $v^\top\nabla^2 f(x^*) v >0$ for all $v\in\mathbb{R}^n\setminus\{0\}$. Let $x$ be a nearby point of $x^*$, i.e. $x = x^* + \epsilon v$ for some arbitrary direction $v$ and some $\epsilon>0$. Then for $\epsilon$ sufficiently small, the Taylor series together with the positive definiteness of $\nabla^2f(x^*)$ gives that $f(x) - f(x^*) \approx \epsilon^2 v^\top\nabla^2 f(x^*) v > 0$ for all $v\in\mathbb{R}^n$. That is, no matter which direction you use to choose $x$, as long as it is sufficiently close to $x^*$ (and not equal to $x^*$), then $f(x)>f(x^*)$, so we conclude that $x^*$ is a local minimum.

From these conditions, we see that for a general function $f$, setting $\nabla f(x^*)=0$ and solving for $x^*$ does not guarantee that $x^*$ is a local minimum. Consider for instance the function $f\colon\mathbb{R}\to\mathbb{R}$ defined by \begin{equation*} f(x) = x^3. \end{equation*} Setting $\nabla f(x^*) = 3x^{*2} = 0$, we conclude that $x^* = 0$. Furthermore, setting $\nabla^2 f(x^*) = 6x^* = 0$, we conclude that $x^*=0 \succeq 0$. Hence, $x^*=0$ satisfies both the first and second order necessary conditions. However, $x^*=0$ is actually a saddle point of this function (to see this, simple plot $f$)! This example shows how even when a point satisfies necessary conditions for optimality, it may not be a local minimum.

On the other hand, consider the function $f\colon\mathbb{R}\to\mathbb{R}$ defined by \begin{equation*} f(x) = x^4. \end{equation*} In this case, $x^*=0$ is a minimum (this is easy to see since $f(x)\ge 0$ for all $x\in\mathbb{R}$ and $f(0)=0$). We have that $\nabla f(x^*) = 4x^{*3} = 0$ and $\nabla^2 f(x^*) = 12x^{*2} = 0$. Hence, $\nabla^2 f(x^*) \succeq 0$ but $\nabla^2 f(x^*) \nsucc 0$. Therefore, the minimum $x^*=0$ satisfies the two necessary conditions (as it must), but it does not satisfy the second order sufficient condition (and therefore to prove $x^*=0$ is a minimum, you must resort to some other technique such as the nonnegativity of $f$ on $\mathbb{R}$ as we mentioned previously).

Intuitively what is going on in these examples is as follows: the first order necessary condition tells us that at $x^*$ the function $f$ is locally flat. This happens at minima, maxima, and saddle points. The second order condition gives us information about the curvature of $f$ at the point $x^*$. Intuitively, if the function curves upward in every direction at that point, then we expect the point to be a local minimum. However, when $\nabla^2 f(x^*)$ is positive semidefinite but not positive definite, then there exists directions along which the function remains flat (in the case of a zero eigenvalue), and therefore the second derivative information doesn't give us enough information to conclude whether the function curves up or down outside of this localized region (this is the case of $f(x)=x^3$ and $f(x)=x^4$ at the point $x^*=0$). Finally, when $\nabla^2 f(x^*)\succ 0$, the second derivative information is giving us a guarantee that in every direction around $x^*$, the function $f$ is locally curving upward, and therefore $x^*$ is a local minimum.

An extremely important remark is that when $f$ is a convex function, the condition that $\nabla f(x^*) = 0$ becomes necessary and sufficient for $x^*$ to be a global minimum of $f$. This is why in least squares optimization problems, setting $\nabla f(x^*)=0$ and solving for $x^*$ is guaranteed to give us a global minimum. Indeed, this global optimality guarantee is precisely why convex optimization is so rich and powerful.

In the case of constrained optimization, i.e. $\inf_{x\in\mathcal{X}}f(x)$ for some $\mathcal{X}\subset\mathbb{R}^n$, the necessary and sufficient optimality conditions become more complicated, since optimal solutions may now exist on the boundary of the feasible set $\mathcal{X}$, where the gradient might not be zero. For more information, look up the Fritz John and KKT conditions, or check out the book Nonlinear Programming by Bertsekas.

Best Answer

Related Solutions

[Math] Minimization and gradient descent

[Math] Minimizing $L_\infty$ norm using gradient descent

Related Question