How intuition leads to a rigorous proof of the existence of Lagrange multipliers

intuitionlagrange multiplieroptimization

This question is about a particular strategy, which I think is very appealing from an intuitive viewpoint, for proving the existence of Lagrange multipliers. The proof is a great application of the "four subspaces theorem" from linear algebra, which states that $N(A)^\perp = R(A^T)$. I will describe the strategy, then state the question down below.

First of all, here is the theorem.

Theorem 1 (Lagrange multiplier optimality condition).
Let $f:\mathbb R^n \to \mathbb R$ and $g:\mathbb R^n \to \mathbb R^c$ be continuously differentiable functions.
(Here $c < n$.)
Suppose that $x^* \in \mathbb R^n$ is a local minimizer for the optimization problem
\begin{align}
\tag{1} \mathop{\text{minimize}}_{x \in \mathbb R^n} & \quad f(x) \\
\text{subject to} & \quad g(x) = 0.
\end{align}

If the $c \times n$ derivative matrix $g'(x^*)$ has full rank (so its rank is $c$), then
there exists a vector $\lambda \in \mathbb R^c$ such that
\begin{equation*}
\nabla f(x^*) = g'(x^*)^T \lambda.
\end{equation*}


The above theorem has a very clear proof in the case where $g$ is affine.
In this case, the optimization problem can be written as
\begin{align*}
\mathop{\text{minimize}}_{x \in \mathbb R^n} & \quad f(x) \\
\text{subject to} & \quad Ax = b
\end{align*}

where $A$ is a real $c \times n$ matrix and $b \in \mathbb R^c$.
We will make use of the "four subspaces theorem" from linear algebra, which tells us that $N(A)^\perp = R(A^T)$.
Suppose that $u \in N(A)$, so $Au = 0$. It must be true that the directional derivative $D_u f(x^*)$
satisfies
\begin{equation}
\tag{2} D_u f(x^*) = \langle \nabla f(x^*), u \rangle = 0,
\end{equation}

because otherwise $x^*$ would not be a local minimizer. Indeed, if (2)
were not satisfied, then we could decrease the value of $f$ without violating the constraint by moving
away from $x^*$ a short distance
in the direction $u$ or $-u$.
Equation (2) tells us that $\nabla f(x^*)$ is orthogonal to $u$.
Since $u$ was an arbitrary null vector of $A$, we conclude that $\nabla f(x^*) \in N(A)^\perp = R(A^T)$, which implies that
$$
\nabla f(x^*) = A^T \lambda
$$

for some vector $\lambda \in \mathbb R^c$. This completes the proof of theorem 1 in the special case where our optimization
problem has linear constraints (that is, in the case where $g$ is affine). In this special case,
we did not need to assume that $A$ has full rank.


In the general case where $g$ is not affine, it is tempting to replace $g$
with its local linear approximation near $x^*$
$$
\tag{3} g(x) \approx g(x^*) + g'(x^*)(x – x^*)
$$

and invoke the above result for the case where the constraints are linear.
If this were valid, it would immediately yield the conclusion of theorem 1!

Trying to develop this idea into a rigorous proof, we might reason as follows: Let $u$ be a null vector of $g'(x^*)$, so that $g'(x^*) u = 0$.
If it were true that we could move a bit in the direction $u$ or $-u$
without violating the constraint that $g(x) = 0$, then
we could conclude (as above) that $D_u f(x^*) = \langle \nabla f(x^*), u \rangle = 0$
(otherwise we could reduce the value of $f$ by moving away from $x^*$ a bit in the direction $u$ or $-u$). The conclusion of theorem 1 would then follow by the same reasoning as above.

Unfortunately, when we move away from $x^*$ a short distance in the direction $u$,
the value of $g$ is not quite constant, because the local linear approximation (3) is only valid to first order. So the constraint $g(x) = 0$ is violated, ever so slightly. This is an obstacle to our proof strategy.

Question: How can this proof strategy be developed into a rigorous proof of theorem 1?

I would like to reach the "post-rigorous" stage, where intuition and rigor are merged, of my understanding of theorem 1.

Here is a related question, but I'm posting this question to focus on how to convert the above intuition into a rigorous proof. I'll post an answer below.

Best Answer

We can salvage the suggested proof by moving along a nearly straight line (a curve) which starts out in the direction $u$ or $-u$, and along which the constraint $g(x) = 0$ remains satisfied. The directional derivative $D_u f(x^*)$ must again be zero, because otherwise we could reduce the value of $f$ by moving away from $x^*$ a short distance along this curve. It then follows, using the same reasoning as above, that $\nabla f(x^*) = g'(x^*)^T \lambda$ for some $\lambda \in \mathbb R^c$.

In more detail, we will use the implicit function theorem to show if $g'(x^*) u = 0$ then there is a smooth parameterized curve $\gamma:(-\epsilon, \epsilon) \to \mathbb R^n$ such that:

\begin{align} \tag{4} \gamma(0) &= x^*,\\ \tag{5} \gamma'(0) &= u,\\ \tag{6} g(\gamma(t)) &= 0 \text{ for all } t \in (-\epsilon, \epsilon). \end{align} The function $F(t) = f(\gamma(t))$ tells us the value of $f$ as we move along the curve $\gamma$. Note that $F'(t) = f'(\gamma(t)) \gamma'(t)$. Because $x^*$ is a local minimizer for problem (1), the function $F(t) = f(\gamma(t))$ must have a local minimum at $t = 0$. Therefore, \begin{align*} 0 &= F'(0) \\ &= f'(x^*) u \\ &= \langle \nabla f(x^*), u \rangle. \end{align*} This shows that $\nabla f(x^*)$ is orthogonal to $u$. Now we can easily finish the proof as in the case of linear equality constraints. So we see that the only difficult part of this proof is showing that such a parameterized curve $\gamma$ exists.


With this strategy in place, we are ready to give a rigorous proof of theorem 1. In the proof below, we will make use of block notation. Let $d = n - c$. The vector $x^* \in \mathbb R^n$ can be written in block notation as $$ x^* = \begin{bmatrix} x_1^* \\ x_2^* \end{bmatrix} $$ where $x_1^* \in \mathbb R^d$ and $x_2^* \in \mathbb R^c$. If $x = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \in \mathbb R^n$ is the concatenation of the vectors $x_1 \in \mathbb R^d$ and $x_2 \in \mathbb R^c$, then we will sometimes write $g(x_1,x_2)$ rather than $g(x)$. The $c \times n$ derivative matrix $g'(x^*)$ can be written in block notation as $$ g'(x^*) = \begin{bmatrix} \frac{\partial g(x_1^*, x_2^*)}{\partial x_1} & \frac{\partial g(x_1^*,x_2^*)}{\partial x_2} \end{bmatrix} $$ where $\frac{\partial g(x_1^*, x_2^*)}{\partial x_1}$ is a $c \times d$ matrix and $\frac{\partial g(x_1^*, x_2^*)}{\partial x_2}$ is a $c \times c$ matrix.

Proof of theorem 1:

By the full rank assumption, we know that the $c \times n$ derivative matrix $g'(x^*)$ has a set of $c$ linearly independent columns. For simplicity, let's assume that the final $c$ columns of $g'(x^*)$ are linearly independent. The general theorem is an easy corollary of this special case.

Let $u = \begin{bmatrix} u_1 \\ u_2 \end{bmatrix}$ be a null vector of $g'(x^*)$, so that \begin{equation} \tag{7} g'(x^*) u = 0 \qquad \text{or equivalently} \qquad \frac{\partial g(x_1^*,x_2^*)}{\partial x_1} u_1 + \frac{\partial g(x_1^*,x_2^*)}{\partial x_2} u_2 = 0. \end{equation} We will construct a parameterized curve $\gamma$ which satisfies the conditions (4) - (6) stated above. By the implicit function theorem, there exist open sets $U_1 \ni x_1^*$ and $U_2 \ni x_2^*$ such that:

  • For each vector $x_1 \in U_1$, there is a unique vector $x_2 \in U_2$ such that $g(x_1,x_2) = 0$.
  • The function $h:U_1 \to U_2$ defined implicitly by $$ g(x_1, h(x_1)) = 0 \quad \text{for all } x_1 \in U_1 $$ is continuously differentiable.

In order to define $\gamma(t) = \begin{bmatrix} \gamma_1(t) \\ \gamma_2(t) \end{bmatrix}$ in such a way that $\gamma(0) = x^*$ and $\gamma'(0) = u$, we must necessarily define $\gamma_1(t)$ so that $\gamma_1(0) = x_1^*$ and $\gamma_1'(0) = u_1$. The easiest way to achieve this is to define $\gamma_1(t) = x_1^* + t u_1$. Because we also desire that $g(\gamma(t)) = 0$ for all $t$, we are then forced to define $\gamma_2(t) = h(\gamma_1(t))$. (Let's declare that the domain of $\gamma$ is an interval $(-\epsilon, \epsilon)$, with $\epsilon > 0$ chosen to be sufficiently small so that $\gamma_1(t)$ belongs to the domain of $h$ whenever $-\epsilon < t < \epsilon$.) Clearly the function $\gamma$ defined in this way satisfies $\gamma(0) = x^*$ and $g(\gamma(t)) = 0$ for all $t$. Let's check that this function $\gamma$ also satisfies $\gamma'(0) = u$. Notice that $$ \gamma'(t) = \begin{bmatrix} \gamma_1'(t) \\ h'(\gamma_1(t)) \gamma_1'(t) \end{bmatrix} = \begin{bmatrix} u_1 \\ h'(\gamma_1(t)) u_1 \end{bmatrix} $$ and so $$ \gamma'(0) = \begin{bmatrix} u_1 \\ h'(x_1^*) u_1 \end{bmatrix}. $$ Hopefully it will turn out that $h'(x_1^*) u_1 = u_2$. Let's see if this is the case. By differentiating both sides of the equation $$ G(t) = g(\gamma_1(t), \gamma_2(t)) = 0 $$ we obtain \begin{align*} &\frac{\partial g(\gamma_1(t),\gamma_2(t))}{\partial x_1} \gamma_1'(t) + \frac{\partial g(\gamma_1(t),\gamma_2(t))}{\partial x_2} \gamma_2'(t) = 0. \end{align*} Noting that $\gamma_2'(t) = h'(\gamma_1(t)) \gamma_1'(t)$ and setting $t = 0$, we find that \begin{align*} &\frac{\partial g(x_1^*,x_2^*)}{\partial x_1} u_1 + \frac{\partial g(x_1^*, x_2^*)}{\partial x_2} h'(x_1^*) u_1= 0. \end{align*} Comparing with equation (7) above, and using the fact that $ \frac{\partial g(x_1^*, x_2^*)}{\partial x_2}$ is invertible, we find that $h'(x_1^*) u_1 = u_2$, as we had hoped.

The hard part of the proof is now complete. We have succeeded in constructing a smooth parameterized curve $\gamma$ which passes through the point $x^*$ in the direction $u$, and along which the constraint $g(x) = 0$ is not violated. In other words, the conditions (4) - (6) are satisfied. The rest of the proof is similar to the proof for the special case where our optimization problem has equality constraints.

To finish the proof, let's consider the function $F(t) = f(\gamma(t))$ which tells us the value of $f$ as we move along the curve $\gamma$. Note that $F'(t) = f'(\gamma(t)) \gamma'(t)$. Because $x^*$ is a local minimizer for problem (1), the function $F(t) = f(\gamma(t))$ must have a local minimum at $t = 0$. Therefore, \begin{align*} 0 &= F'(0) \\ &= f'(x^*) u \\ &= \langle \nabla f(x^*), u \rangle. \end{align*} So $\nabla f(x^*)$ is orthogonal to $u$. But $u$ is an arbitrary element of the null space of $g'(x^*)$. Thus, $\nabla f(x^*) \in N(g'(x^*))^\perp = R(g'(x^*)^T)$. This means that $$ \nabla f(x^*) = g'(x^*)^T \lambda $$ for some vector $\lambda \in \mathbb R^c$, which is what we wanted to show.

Related Question