[Math] Do we maximize the Lagrangian function or minimize it for the SVM derivation? And why

lagrange multipliermachine learningoptimization

These are the resources I've looked into:

  1. MIT lecture
  2. Math.stackexchange question
  3. Slide from MIT
  4. Stanford lecture
  5. Caltech lecture

The professor in (1) at 25:47 says that "I don't care if we find the maximum or minimum (of the Lagrangian function) …. we'll just find the extremum…"

Accepted answer in (2) tries to imply that we're trying to minimize the (primal) Lagrangian.

In (3), slide 14 explicitly mentions that we're trying to minimize the function

In (4) at 24:30 the professor mentions that we'll just set the derivative for the Lagrangian function to zero with respect to the hyperplane parameters ($\boldsymbol{w}$ & $b$) (this is of course, how you find the extremum)

In (5) at 33:09 the professor again mentions that we're minimizing the function with respect to $\boldsymbol{w}$ and $b$ but maximizing with respect to the Lagrange multipliers

Question:
I'm still not clear/convinced about if or not the Lagrangian function i.e.

$$\mathcal{L}(\boldsymbol{w}, b, \boldsymbol{\alpha}) = \frac{1}{2}\|\boldsymbol{w}\|^2 + \sum_i \alpha_i\left[y_i(\boldsymbol{w}^T\boldsymbol{x}+b)-1\right].$$

needs to be minimized or not? (it looks like it does, in which case – why?)

What I understand is, the method of Lagrange multipliers lets you find the extremes of an objective function when you set the Lagrangian function to zero. But how do we know if we're minimizing the function or not?

Also not sure why when you set those values to zero and substitute the result from it back to the function, why does the function need to be maximized now instead of minimizing?

Best Answer

The primal SVM problem is: $$\min \left\{\frac{1}{2}\|\boldsymbol{w}\|^2 : y_i(\boldsymbol{w}^T\boldsymbol{x}+b)\geq1 \right\}.$$ Therefore, the Lagrangian should also be minimized. Since the Lagrangian is convex in $w$ and $b$, a critical point corresponds to a global minimum.

What is so beautiful about Lagrange duality is that if the primal is feasible (or strictly feasible if the constraints are nonlinear), then: $$\min_{w,b} \max_\alpha \mathcal{L}(\boldsymbol{w}, b, \boldsymbol{\alpha}) = \max_\alpha \min_{w,b} \mathcal{L}(\boldsymbol{w}, b, \boldsymbol{\alpha}).$$ On the left hand side you have the primal problem, on the rhs is the dual problem. This property is called strong duality (and the feasibility condition is called Slater's condition). Strong duality requires the problem to be convex.

Related Question