[Math] KKT multipliers and “active” and “inactive” constraints on the generalized Lagrangian $L$

karush kuhn tuckerlagrange multipliermachine learningmultivariable-calculusoptimization

The textbook Deep Learning by Goodfellow, Bengio, and Courville, says the following in a section on constrained optimization:

The inequality constraints are particularly interesting. We say that a constraint $h^{(i)}(\mathbf{x})$ is active if $h^{(i)}(\mathbf{x}^*) = 0$. If a constraint is not active, then the solution to the problem found using that constraint would remain at least a local solution if that constraint were removed. It is possible that an inactive constraint excludes other solutions. For example, a convex problem with an entire region of globally optimal points (a wide, flat region of equal cost) could have a subset of this region eliminated by constraints, or a nonconvex problem could have better local stationary points excluded by a constraint that is inactive at convergence. Yet the point found at convergence remains a stationary point whether or not the inactive constraints are included. Because an inactive $h^{(i)}$ has negative value, then the solution to $\min_{\boldsymbol{\mathcal{x}}} \max_{\boldsymbol{\mathcal{\lambda}}} \max_{\boldsymbol{\mathcal{\alpha, \alpha}}\ge 0} L(\boldsymbol{\mathcal{x}}, \boldsymbol{\mathcal{\lambda}}, \boldsymbol{\mathcal{\alpha}})$ will have $\alpha_i = 0$. We can thus observe that at the solution, $\mathbf{\alpha} \odot \mathbf{h(x)} = \mathbf{0}$. In other words, for all $i$, we know that at least one of the constraints $\alpha_i \ge 0$ or $h^{(i)}(x) \le 0$ must be active at the solution. To gain some intuition for this idea, we can say that either the solution is on the boundary imposed by the inequality and we must use its KKT multiplier to influence the solution to $\mathbf{x}$, or the inequality has no influence on the solution and we represent this by zeroing out its KKT multiplier.

Prior to this section, the authors define the aforementioned functions as follows:

To define the Lagrangian, we first need to describe $\mathbb{S}$ in terms of equations and inequalities. We want a description of $\mathbb{S}$ in terms of $m$ functions $g^{(i)}$ and $n$ functions $h^{(j)}$ so that $\mathbb{S} = \{ \mathbf{x} | \forall i, g^{(i)}(\mathbf{x}) = 0 \ \text{and} \ \forall j, h^{(j)}(\mathbf{x}) \le 0 \}$. The equations involving $g^{(i)}$ are called the equality constraints, and the inequalities involving $h^{(j)}$ are called the inequality constraints.

So it seems that the $h^{(i)}$ being described here is the "inactive" constraint.

Furthermore, when the authors say that, because an inactive $h^{(i)}$ has negative value, the solution to $\min_{\boldsymbol{\mathcal{x}}} \max_{\boldsymbol{\mathcal{\lambda}}} \max_{\boldsymbol{\mathcal{\alpha, \alpha}}\ge 0} L(\boldsymbol{\mathcal{x}}, \boldsymbol{\mathcal{\lambda}}, \boldsymbol{\mathcal{\alpha}})$ will have $\alpha_i = 0$, this is because, in the section prior to saying this, they introduce new variables $\lambda_i$ and $\alpha_j$ for each constraint (these are called KKT multipliers), and define the generalized Lagrangian as

$$L(\boldsymbol{\mathcal{x}}, \boldsymbol{\mathcal{\lambda}}, \boldsymbol{\mathcal{\alpha}}) = f(\mathbf{x}) + \sum_i \lambda_i g^{(i)} (\mathbf{x}) + \sum_{j} \alpha_j h^{(j)}(\mathbf{x})$$

So therefore, evidently, if we are calculating $\min_{\boldsymbol{\mathcal{x}}} \max_{\boldsymbol{\mathcal{\lambda}}} \max_{\boldsymbol{\mathcal{\alpha, \alpha}}\ge 0} L(\boldsymbol{\mathcal{x}}, \boldsymbol{\mathcal{\lambda}}, \boldsymbol{\mathcal{\alpha}})$, then, in order to satisfy $\max_{\boldsymbol{\mathcal{\alpha, \alpha}}\ge 0}$, we require $\alpha_j = 0$, since, again, $h^{(j)} \le 0$ (that is, it is "inactive").

But this section still leaves me unclear on a number of things:

When the authors refer to "convergence" in this context, they're referring to convergence to the stationary point, right?
Why is it the case that an "inactive" constraint potentially excludes other solutions? Is this just due to the fact that the values being considered during convergence are constrained to $h^{(j)}(\mathbf{x}) \le 0$, and so we are not also considering the values $h^{(j)}(\mathbf{x}) \ge 0$?
I'm not understanding the following explanation and the idea of how the KKT multipliers influence the solution:

To gain some intuition for this idea, we can say that either the solution is on the boundary imposed by the inequality and we must use its KKT multiplier to influence the solution to $\mathbf{x}$, or the inequality has no influence on the solution and we represent this by zeroing out its KKT multiplier.

I would greatly appreciate it if people would please take the time to clarify these points.

Best Answer

Let me try: being active or inactive is not a property of the constraint itself, it depends on the constraint and the objective function. What the authors define in that paragraph are general equality/inequality constraints.

For ex, if you are minimizing $f(x,y)=x^2+y^2$ subjected to $h(x,y)=x^2+y^2-1\le 0$ (in the unit disk), the minimum is achieved at $(0,0)$, where $h(0,0)=-1<0$ so the constraint is not active. Now take $f(x,y)=(x-2)^2+y^2$ with the same constraint. The minimum is achieved on the boundary of your domain $\{(x,y):h(x,y)=x^2+y^2-1\le 0\}$, namely at $(1,0)$ ($f$ is just a shifted paraboloid) and this time the constraint is active, since $h(1,0)=0$. Being active implies that the condition for a minimum is not just the unconstrained $\nabla f(P)=0$ but it also involves the constraint and a non-zero Lagrange multiplier appears because what you have is a conditional extremum on the curve $h=0$. At such minimum, it is not true that $\nabla f=0$ but $\nabla f+\lambda\nabla h=0$. In other words, inequality constraints are active for your problem when the max/min happens at a point in the boundary, when $h=0$ and then a Lagrange multiplier appears. There are as many non-zero multipliers as "active" constraints. The language used by the authors is very unclear, very typical of Data science/Machine Learning books (many words to say something that could be explained in one clear line).

Best Answer

Related Solutions

[Math] Determining active constraints in KKT

[Math] Kuhn Tucker conditions, and the sign of the Lagrangian multiplier

Related Question