[Math] How to prove Lagrange multiplier theorem in a rigorous but intuitive way

differential-geometryimplicit-function-theoremlagrange multiplieroptimization

Following some text books, the Lagrange multiplier theorem can be described as follows.

Let $U \subset \mathbb{R}^n$ be an open set and let $f:U\rightarrow \mathbb{R}, g:U\rightarrow \mathbb{R}$ be $C^1$ functions. Let $\mathbf{x}_{0} \in U, c=g(\mathbf{x}_0)$, and let $S$ be the level set of $g$ with value $c$. Suppose that $\nabla g(\mathbf{x}_0) \neq 0$.
If the function $f|_S$ has a local extremum at $\mathbf{x}_0$ then there is a $\lambda \in \mathbb{R}$ so that $\nabla f(\mathbf{x}_0)=\lambda g(\mathbf{x}_0)$

Some textbooks give a strict proof using implicit function theorem (see proof). The proof is very clear, however I can't build some intuition from the proof. Can someone help me prove this theorem in a rigorous but more intuitive way? Thanks.

Best Answer

Here's some intuition.

First suppose that $x_0$ is a local minimizer for the problem \begin{align} \tag{$\spadesuit$} \text{minimize} & \quad f(x) \\ \text{subject to} &\quad Ax = b. \end{align} If $Au = 0$, then the directional derivative $D_u f(x_0) = \langle \nabla f(x_0),u \rangle$ is nonnegative. Otherwise, it would be possible to improve $x_0$ by perturbing it a bit in the direction $u$.

It follows that if $Au = 0$, then $\langle \nabla f(x_0),u \rangle = 0$.

So, $\nabla f(x_0)$ is in the orthogonal complement of the null space of $A$.

BUT, according to the "four subspaces theorem" (which is beloved by Gilbert Strang), the orthogonal complement of the null space of $A$ is the range of $A^T$. Thus, $\nabla f(x_0)$ is in the range of $A^T$. In other words, there exists a vector $\lambda$ such that \begin{equation} \nabla f(x_0) = A^T \lambda. \end{equation} This is our Lagrange multiplier optimality condition, in the case where we have linear equality constraints.

Next, suppose that $x_0$ is a local minimizer for the problem \begin{align} \text{minimize} & \quad f(x) \\ \text{subject to} &\quad g(x) = 0, \end{align} where $g:\mathbb R^n \to \mathbb R^m$.

It seems plausible that $x_0$ is also a local minimizer for the problem we get by replacing $g$ with the local linear approximation to $g$ at $x_0$: \begin{align} \text{minimize} & \quad f(x) \\ \text{subject to} &\quad g(x_0) + g'(x_0)(x - x_0) = 0. \end{align} (Here $g'(x_0)$ is the Jacobian matrix of $g$ at $x_0$.) Assuming that $x_0$ is indeed a local minimizer for this modified problem, it follows that $x_0$ satisfies the optimality condition derived above: \begin{align} \nabla f(x_0) &= g'(x_0)^T \lambda \\ &= \nabla g_1(x_0) \lambda_1 + \cdots + \nabla g_m(x_0) \lambda_m \end{align} for some vector $\lambda = \begin{bmatrix} \lambda_1 & \ldots & \lambda_m \end{bmatrix}^T \in \mathbb R^m$. (Here $g_1,\ldots,g_m$ are the component functions of $g$, and we use the convention that the gradient is a column vector.) This is our Lagrange multiplier optimality condition in the case of nonlinear equality constraints.

I believe it's possible to view the proof using the implicit function theorem as a rigorous version of this intuition.


Edit: Now I'll show how a similar approach can handle inequality constraints, if we replace the "four subspaces theorem" with Farkas' lemma (which I call the four cones theorem).

Let $x^*$ be a local minimizer for the problem \begin{align} \tag{$\heartsuit$} \text{minimize} & \quad f(x) \\ \text{subject to} & \quad Ax \leq b. \end{align} Suppose that $u \in \mathbb R^n$ and $Au \leq 0$. Then the directional derivative $D_uf(x^*) = \langle \nabla f(x^*),u \rangle \geq 0$. Otherwise, we could improve $x^*$ by perturbing it a bit in the direction of $u$.

At this point, we would like to invoke something like the four subspaces theorem, to find some optimality condition that $\nabla f(x^*)$ must satisfy. While the four subspaces theorem is inapplicable here, there is a one-sided version of the four subspaces theorem that fits this situation perfectly.

The four cones theorem

We first recall a few facts from convex analysis. If $C \subset \mathbb R^n$ is a cone, then the polar of $C$ is defined by \begin{equation} C^\circ = \{z \mid \langle x,z \rangle \leq 0 \,\, \forall \,\, x \in C \}. \end{equation} Note that $C^\circ$ is a closed convex cone.
A convex cone is a "one-sided" version of a subspace, and the polar cone is a "one-sided" version of the orthogonal complement. Instead of $\langle x,z \rangle = 0$, we have $\langle x, z \rangle \leq 0$. Here's some support for this analogy:

  • For a subspace $W$ of $\mathbb R^n$, we have the familiar equation $W = W^{\perp \perp}$. The one-sided version of this is that if $C$ is a closed convex cone, then $C = C^{\circ \circ}$.
  • If $W$ is a subspace of $\mathbb R^n$, then any $x \in \mathbb R^n$ can be decomposed as $x = \Pi_W(x) + \Pi_{W^\perp}(x)$. The one-sided version of this is that if $C \subset \mathbb R^n$ is a closed convex cone, then any $x \in \mathbb R^n$ can be decomposed as $x = \Pi_C(x) + \Pi_{C^\circ}(x)$. Moreover, $\Pi_C(x)$ is orthogonal to $\Pi_{C^\circ}(x)$.

Let $\mathcal R_+(A^T) = \{ A^T z \mid z \geq 0 \}$. So $\mathcal R_+(A^T)$ is a ``one-sided version'' of the column space of $A^T$. I call it the column cone of $A^T$. Let $ \mathcal N_-(A) = \{x \mid Ax \leq 0 \}$. So $\mathcal N_-(A)$ is a one-sided version of the null space of $A$. I call it the null cone of $A$. Note that $\mathcal N_-(A)$ and $\mathcal R_+(A^T)$ are both closed convex cones.

The four subspaces theorem can be stated as \begin{equation} \mathcal N(A)^\perp = \mathcal R(A^T). \end{equation} Here is our "one-sided" version of the four subspaces theorem: \begin{equation} \mathcal N_-(A)^\circ = \mathcal R_+(A^T). \end{equation} I call this the "four cones theorem".

Invoking the four cones theorem

We can use the "four cones theorem" to find an optimality condition for problem ($\heartsuit$), just as we used the four subspaces theorem to find an optimality condition for problem ($\spadesuit$). As noted previously, if $A u \leq 0$, then $\langle \nabla f(x^*), u \rangle \geq 0$. In other words, $-\nabla f(x^*) \in \mathcal N_-(A)^\circ$. By the four cones theorem, it follows that $-\nabla f(x^*) \in \mathcal R_+(A^T)$. Thus, there exists a vector $z \geq 0$ such that \begin{equation} \nabla f(x^*) + A^T z = 0. \end{equation} This, together with $Ax^* \leq b$, is our optimality condition (KKT condition) for the problem ($\heartsuit$) with linear inequality constraints.

Connection with Farkas's lemma

The four cones theorem can be stated as a ``theorem of the alternative''. If $g \in \mathbb R^n$, then $g$ either does or does not belong to $\mathcal R_+(A^T)$. In other words, either $g$ belongs to $\mathcal R_+(A^T)$, or else $g$ does not belong to the polar of $\mathcal N_-(A)$. This means that one and only one of the following two alternatives is true:

  1. There exists $z \geq 0$ such that $g = A^T z$.
  2. There exists $u$ such that $Au \leq 0$ and $\langle g, u \rangle > 0$.

This is equivalent to the statement of Farkas's lemma given in Boyd and Vandenberghe. The four cones theorem is Farkas's lemma.

Visualizing the four cones theorem

The "alternative" statement of the four cones theorem suggests a way of visualizing the four cones theorem that makes it seem obvious. The first alternative is that $g$ belongs to the convex cone generated by the columns of $A^T$. The second alternative is that there is a hyperplane through the origin that separates $g$ from the columns of $A^T$. (The vector $u$ is a normal vector to this separating hyperplane.)

Problem with linear equality and inequality constraints

This case is similar to the previous cases, except that we will use a different version of Farkas's lemma that involves a combination of equality and inequality constraints.

Related Question