Correct way to interpret notation used in optimization problem

karush kuhn tuckerlagrange multipliernotationoptimization

My textbook, Deep Learning by Goodfellow, Bengio, and Courville, says the following in a section on constrained optimization:

The Karush-Kuhn-Tucker (KKT) approach provides a very general solution to constrained optimization. With the KKT approach, we introduce a new function called the generalized Lagrangian or generalized Lagrange function.

To define the Lagrangian, we first need to describe $\mathbb{S}$ in terms of equations and inequalities. We want a description of $\mathbb{S}$ in terms of $m$ functions $g^{(i)}$ and $n$ functions $h^{(j)}$ so that $\mathbb{S} = \{ \boldsymbol{\mathcal{x}} \mid \forall i, g^{(i)}(\boldsymbol{\mathcal{x}}) = 0 \ \text{and} \ \forall j, h^{(j)} (\boldsymbol{\mathcal{x}}) \le 0 \}$. The equations involving $g^{(i)}$ are called the equality constraints, and the inequalities involving $h^{(j)}$ are called inequality constraints.

We introduce new variables $\lambda_i$ and $\alpha_j$ for each constraint, these are called the KKT multipliers. The generalized Lagrangian is then defined as

$$L(\boldsymbol{\mathcal{x}}, \boldsymbol{\lambda}, \boldsymbol{\alpha}) = f(\boldsymbol{\mathcal{x}}) + \sum_i \lambda_i g^{(i)} (\boldsymbol{\mathcal{x}}) + \sum_j \alpha_j h^{(j)}(\boldsymbol{\mathcal{x}}) \tag{4.14}$$

We can now solve a constrained minimisation problem using unconstrained optimization of the generalized Lagrangian. As long as at least one feasible point exists and $f(\boldsymbol{\mathcal{x}})$ is not permitted to have value $\infty$, then

$$\min_{\boldsymbol{\mathcal{x}}} \max_{\boldsymbol{\mathcal{\lambda}}} \max_{\boldsymbol{\mathcal{\alpha, \alpha}}\ge 0} L(\boldsymbol{\mathcal{x}}, \boldsymbol{\mathcal{\lambda}}, \boldsymbol{\mathcal{\alpha}}) \tag{4.15}$$

has the same optimal objective function value and set of optimal points $\boldsymbol{\mathcal{x}}$ as

$$\min_{\boldsymbol{\mathcal{x}} \in \mathbb{S}} f(\boldsymbol{\mathcal{x}}). \tag{4.16}$$

This follows because any time the constraints are satisfied,

$$\max_{\boldsymbol{\mathcal{\lambda}}} \max_{\boldsymbol{\mathcal{\alpha, \alpha}}\ge 0} L(\boldsymbol{\mathcal{x}}, \boldsymbol{\mathcal{\lambda}}, \boldsymbol{\mathcal{\alpha}}) = f(\boldsymbol{\mathcal{x}}),$$

while any time a constraint is violated,

$$\max_{\boldsymbol{\mathcal{\lambda}}} \max_{\boldsymbol{\mathcal{\alpha, \alpha}}\ge 0} L(\boldsymbol{\mathcal{x}}, \boldsymbol{\mathcal{\lambda}}, \boldsymbol{\mathcal{\alpha}}) = \infty$$

these properties guarantee that no infeasible point can be optimal, and that the optimum within the feasible points is unchanged.

My questions concerns the notation used in (4.15):

$$\min_{\boldsymbol{\mathcal{x}}} \max_{\boldsymbol{\mathcal{\lambda}}} \max_{\boldsymbol{\mathcal{\alpha, \alpha}}\ge 0} L(\boldsymbol{\mathcal{x}}, \boldsymbol{\mathcal{\lambda}}, \boldsymbol{\mathcal{\alpha}})$$

What specifically is meant by $\min_\limits{\boldsymbol{\mathcal{x}}} \max_\limits{\boldsymbol{\mathcal{\lambda}}} \max_\limits{\boldsymbol{\mathcal{\alpha, \alpha}}\ge 0}$? Does this mean that we want to find the $\boldsymbol{\mathcal{\alpha}}\ge 0$ that maximizes $L(\boldsymbol{\mathcal{x}}, \boldsymbol{\mathcal{\lambda}}, \boldsymbol{\mathcal{\alpha}})$, and then the $\boldsymbol{\mathcal{\lambda}}$ that maximizes $L(\boldsymbol{\mathcal{x}}, \boldsymbol{\mathcal{\lambda}}, \boldsymbol{\mathcal{\alpha}})$, and then the $\boldsymbol{\mathcal{x}}$ that minimizes $L(\boldsymbol{\mathcal{x}}, \boldsymbol{\mathcal{\lambda}}, \boldsymbol{\mathcal{\alpha}})$? Or does it mean that we want to find the $\boldsymbol{\mathcal{\alpha}}\ge 0$ that maximizes the term in $L(\boldsymbol{\mathcal{x}}, \boldsymbol{\mathcal{\lambda}}, \boldsymbol{\mathcal{\alpha}})$ that contains $\boldsymbol{\mathcal{\alpha}}$, and then the $\boldsymbol{\mathcal{\lambda}}$ that maximizes the term in $L(\boldsymbol{\mathcal{x}}, \boldsymbol{\mathcal{\lambda}}, \boldsymbol{\mathcal{\alpha}})$ that contains the $\boldsymbol{\mathcal{\lambda}}$, and then the $\boldsymbol{\mathcal{x}}$ that minimizes the term in $L(\boldsymbol{\mathcal{x}}, \boldsymbol{\mathcal{\lambda}}, \boldsymbol{\mathcal{\alpha}})$ that contains $\boldsymbol{\mathcal{x}}$? What is the correct way to interpret this notation in general?

I would greatly appreciate it if people could please take the time to clarify this.

EDIT:

Please see my response, in the comments, to Lonza Leggiera. If my first interpretation is correct, and, as Lonza Leggiera says, we maximize/minimize as a function of the other variables, then, in general (not necessarily just this optimization problem in particular), couldn't the value that we select for the first step/variable as a maximizer/minimizer change depending on what is selected for the other variables? This is one of the aspects that I found confusing in my trail of thought.

Best Answer

$$\min_{\boldsymbol{\mathcal{x}}} \max_{\boldsymbol{\mathcal{\lambda}}} \max_{\boldsymbol{\mathcal{\alpha, \alpha}}\ge 0} L(\boldsymbol{\mathcal{x}}, \boldsymbol{\mathcal{\lambda}}, \boldsymbol{\mathcal{\alpha}}) \tag{4.15}$$

Define $$f(\boldsymbol{\mathcal{x}})=\max_{\boldsymbol{\mathcal{\lambda}}} g(\boldsymbol{\mathcal{x}},\boldsymbol{\mathcal{\lambda}})$$ with $$g(\boldsymbol{\mathcal{x}}, \boldsymbol{\mathcal{\lambda}}) = \max_{\boldsymbol{\mathcal{\alpha, \alpha}}\ge 0} L(\boldsymbol{\mathcal{x}}, \boldsymbol{\mathcal{\lambda}}, \boldsymbol{\mathcal{\alpha}})$$ then: $$\min_{\boldsymbol{\mathcal{x}}} \max_{\boldsymbol{\mathcal{\lambda}}} \max_{\boldsymbol{\mathcal{\alpha, \alpha}}\ge 0} L(\boldsymbol{\mathcal{x}}, \boldsymbol{\mathcal{\lambda}}, \boldsymbol{\mathcal{\alpha}}) = \min_{\boldsymbol{\mathcal{x}}} f(\boldsymbol{\mathcal{x}})$$

Related Question