[Math] Help me organize these concepts — KKT conditions and dual problem

convex optimizationduality-theoremskarush kuhn tucker

This is a long question in which I explain my current understanding of certain ideas. If anyone is interested in reading this and would like to provide any commentary/feedback that may help me understand these ideas more clearly, or that you think I might find interesting, I'd appreciate it!

In multivariable calculus, to minimize a function $f:\mathbb R^N \to \mathbb R$ we write down the optimality condition $\nabla f(x) = 0$ and solve for $x$. Multivariable calculus classes usually take this further and give a method for minimizing $f$ subject to an equality constraint $h(x) = 0$. You form the Lagrangian $L(x,z) = f(x) + \langle h(x), z \rangle$, write down the optimality condition

\begin{align}
h(x) &= 0 \\
\frac{\partial L(x,z)}{\partial x} &= 0
\end{align}
and solve for $x$ (and $z$).

Calculus classes normally stop there, but they could take it further still and show how to handle inequality constraints — in that case we'll have a nonnegativity constraint and "complementary slackness" in the optimality conditions (KKT conditions).

Under certain assumptions ("constraint qualifications"), these optimality conditions are necessary (but not sufficient) for $x$ to be a local minimizer.

These optimality conditions can be derived by linearizing about a minimizer, or by drawing some pictures and making some geometric arguments. (Nocedal and Wright presents this material well.)

So far, we have not mentioned convexity at all, nor have we mentioned the dual problem. So how do those ideas fit into the picture?

A calculus class could mention that if our optimization problem is convex, then the necessary conditions in fact are sufficient for $x$ to be a global minimizer.

But there is more to be said about the convex case. When working with convex functions, it's unnatural to require them to be differentiable. We should allow for non-differentiable convex functions and use subgradients rather than gradients. Indeed a version of the KKT conditions for convex optimization problems can be given which does not assume differentiability.

And how do we prove that for a convex optimization problem, if an appropriate "constraint qualification" is satisfied, then the subgradient version of the KKT conditions is necessary and sufficient for $x$ to be a minimizer?

There are different ways to prove this. (QUESTION: Can anyone summarize the various approaches that could be taken to prove this?) One way, perhaps the best way, is to prove that: 1) if Slater's condition is satisfied, then strong duality holds and there is a dual optimal variable; 2) if strong duality holds, then the KKT conditions are satisfied by $x,z$ if and only if $x$ is primal optimal and $z$ is dual optimal. (This is the approach taken in Boyd and Vandenberghe.) For the first time in this development we have now mentioned the dual problem. I think other approaches to proving a KKT theorem in convex optimization don't use the dual problem at all.

And where did this dual problem come from? How do we motivate that? What's the best way to think about it?

One of the key ideas of convex analysis is that a closed convex set $C$ is the intersection of all closed half-spaces that contain $C$. I think this could be called the source of all duality results in convex analysis. When this idea is applied to a convex cone $K$, we discover the dual cone $K^*$ and the fact that $K^{**} = K$. When this idea is applied to the epigraph of a closed convex function $f$, we discover the convex conjugate $f^*$ and the fact that $f^{**} = f$. It's tempting to apply this idea (somehow) to a convex optimization problem. How do we associate a closed convex set $C$ to a convex optimization problem? If the problem is

\begin{align}
\text{minimize}_x & \quad f(x) \\
\text{subject to} & \quad Ax = b \\
&\quad h(x) \leq 0,
\end{align}
and $\mathcal D$ is the domain of the problem,
then we can form something like an "epigraph" for this optimization problem as follows:
\begin{equation}
C = \{ (u,v,t) \mid \exists x \in \mathcal D \text{s.t.} u = Ax – b, h(x) \leq v, f(x) \leq t \}.
\end{equation}

By thinking of the optimization problem in terms of half-spaces containing $C$ (or in terms of hyperplanes supporting $C$), we get a geometric interpretation of the dual problem and a nice way to prove a strong duality theorem. (This is the approach taken in Boyd and Vandenberghe.)

QUESTIONS: Have I explained anything here incorrectly? Is there a better way to organize or understand these ideas, or a better way to put them in perspective? Do you like this way of thinking about these ideas? Do you have any comments, even if not directly related, that you think I might find to be interesting or that might help me gain a deeper understanding of these ideas?

Thanks!

Best Answer

I believe that you should look up the Fritz John conditions. My opinion is that they are superior to the KKT conditions, in that they incorporate the rather ugly issue of the "constraint qualification" into the Lagrangean by the use of an additional multiplier -and they are able to uncover solutions to an optimization problem that under KKT may pass unnoticed. The Fritz John conditions have been stated in
"F. JOHN. Extremum problems with inequalities as side conditions. In “Studies and Essays, Courant Anniversary Volume” (K. O. Friedrichs, O. E. Neugebauer and J. J. Stoker, eds.), pp. 187-204. Wiley (Interscience), New York, 1948"

and have been generalized in

"Mangasarian, O. L., & Fromovitz, S. (1967). The Fritz John necessary optimality conditions in the presence of equality and inequality constraints. Journal of Mathematical Analysis and Applications, 17(1), 37-47.

In a simplified setting, assume we want to \begin{align} \max_x &f(x)\\ s.t. & g(x) \ge 0\\ & h(x) = 0 \end{align}

Then the Lagrangean in the case of Fritzh John conditions is formed as

$$L_{FJ} = \xi f(x) + \lambda g(x)+\mu h(x) $$ $$\lambda, \mu \ge 0,\qquad \xi \in\{0,1\},\qquad \{\xi , \lambda, \mu\}\neq \mathbf 0$$

The new element is the multiplier $\xi$ on the objective function, which takes only the values zero or unity (after normalization). If a solution necessitates that $\xi =1$, we obtain the KKT conditions with the constraint qualification satisfied. If a solution necessitates that $\xi =0$, it reflects, among other special cases, the case where the constraint qualification fails to hold.

A standard example is the case where the feasible set for $x$ has been reduced to a single point due to the constraints. Then we will find that the only solution dictates that $\xi=0$, which has an intuitive explanation: if $x$ can take one and only one value due to the constraints, then the objective function "plays no role" in the determination of $x$ and so it gets a zero multiplier.
This is not so special a case as it may appear: in our problem they may exist parameters that may vary in some range. And for some combination(s) of their values the feasible set may become a single point. If you have applied KKT conditions in such a setting, and you used only algebraic calculations, you could end up characterizing such cases as "no solution" while a solution does exist. I have seen it happen - this is why I believe that the FJ conditions are superior.