Solved – How to choose between dual gradient descent and the method of Lagrangian multipliers

For an optimization problem
$$
\max f(x)\\\
s.t. g(x)\le 0
$$
The Lagrangian is
$$
\mathcal L(x, \lambda)=f(x)-\lambda g(x)
$$
Dual gradient descent solves it by (according to Page 43 of this lecture, I modify the process for solving a maximization problem here)

We first find the an optimal value of $x$ that maximizes $\mathcal L(x,\lambda)$, i.e., solving $x^*\leftarrow\arg\max_x\mathcal L (x,\lambda)$
Then we apply gradient descent on $\lambda$: $\lambda \leftarrow \lambda – \alpha \nabla_\lambda \mathcal L(x^*,\lambda)$
Repeat the above process until convergence

The method of Lagrangian multipliers solves it by directly computing the gradients of both $x$ and $\lambda$ and setting them to zero.

It seems to me that the only difference between these two algorithms is that one repeatedly performs gradient descent on $\lambda$, and the other computes $\lambda$ directly by setting $\nabla_\lambda\mathcal L(x,\lambda)$ to zero. Am I right? If that's the only difference, when should we prefer one over the other?

Best Answer

We need to satisfy the KKT conditions(first order necessary conditions) in order to find the optimal solution.

KKT conditions for equality constraints:

$$ Stationary: \nabla_x \mathcal{L}(x,\lambda) = 0 \Rightarrow \nabla_x f(x) = \lambda \nabla_x g(x) $$

$$ Primal \, feasibility: \nabla_{\lambda} \mathcal{L}(x,\lambda) = 0 \Rightarrow g(x)=0 $$

(there are additional constraints for inequality constraints)

The above conditions lead to a system of (generally non-linear) equations which are solved using root finding algorithms like Newtons Method/Line Search etc. These gradient based root finding algorithms require derivatives of the above equations and contain second derivatives of the objective function and first order gradients of the constraints i.e $\nabla_{xx} f(x), \nabla_{x} g(x)$. This poses two problems, firstly it imposes higher differentiability requirements on those functions and most importantly it is computationally expensive to compute higher order gradients. But this allows for higher order convergence as compared to the following method.

Alternatively, the KKT-conditions are also a saddle point condition. Loosely speaking, we look for a value of $x$ that maximizes the objective function and minimizes $\lambda$ so that we minimize the imposition of the constraint.

$$ \max_{x} \min_{\lambda} \mathcal{L}(x,\lambda) $$ here, we can apply the suggested Dual descent method to find the saddle/optimal point. Since this a gradient descent method, convergence would be relatively slower but at the same time the computational effort is also minimal.

Best Answer

Related Solutions

Solved – EM for Mixtures of Bernoulli (M-step)

Related Question