Solved – SVM cost function: old and new definitions

classificationdata miningloss-functionsmachine learningsvm

I am trying to reconcile different definitions of the soft-margin SVM cost / loss function in primal form. There is a "max()" operator that I do not understand.

I learned about SVM many years ago from the undergraduate-level textbook "Introduction to Data Mining" by Tan, Steinbach, and Kumar, 2006. It describes the soft-margin primal form SVM cost function in Chapter 5, p. 267-268. Note that there is no mention of a max() operator.

This can be done by introducing positive-valued slack variables ($\xi$)
into the constraints of the optimization problem. …
The modified objective function is given by the following equation:

$ f(\mathbf{w}) = \frac{\left \| \mathbf{w} \right \|^2}{2} + C(\sum_{i=1}^{N} \xi)^k $

where $C$ and $k$ are user-specified parameters representing the penalty
of misclassifying the training instances. For the remainder of this
section, we assume $k$ = 1 to simplify the problem. The parameter $C$ can be
chosen based on the model's performance on the validation set.

It follows that the Lagrangian for this constrained optimization
problem can be written as follows:

$ L_{p} = \frac{\left \| \mathbf{w} \right \|^2}{2} + C(\sum_{i=1}^{N}
\xi)^k – \sum_{i=1}^{N} \lambda_i (y_i (\mathbf{w} \cdot \mathbf{x_i}
+ b) – 1 + \xi_i) – \sum_{i=1}^{N} \mu_i \xi_i $

where the first two terms are the objective function to be minimized,
the third term represents the inequality constraints associated with
the slack variables, and the last term is the result of the
non-negativity requirements on the values of the $\xi_i$'s.

So that was from a 2006 textbook.

Now (in 2016), I started reading more recent material on SVM. In the Stanford class for image recognition, the soft-margin primal form is stated in a much different way:

Relation to Binary Support Vector Machine. You may be coming to this
class with previous experience with Binary Support Vector Machines,
where the loss for the i-th example can be written as:

Stanford SVM

Similarly, on Wikipedia's article on SVM, the loss function is given as:

enter image description here

Where is this "max" function coming from? Is it contained in the first two formulas in the "Introduction to Data Mining" version? How do I reconcile the old and new formulations? Is that textbook simply out-of-date?

Best Answer

The slack variable $\xi$ is defined as follows (picture from Pattern Recognition and Machine Learning).

enter image description here

$\xi_i = 1- y_i(\omega x_i+b)$ if $x_i$ is on the wrong side of the margin (i.e., $1- y_i(\omega x_i+b)>0$), $\xi_i=0$ otherwise.

Thus $\xi_i=max(0, 1- y_i(\omega x_i+b))\qquad(1)$.

So minimizing the first definition subject to constraint (1) $$ f(\mathbf{w}) = \frac{\left \| \mathbf{w} \right \|^2}{2} + C(\sum_{i=1}^{N} \xi)^k $$ is equivalent to minimizing the second definition (regularizer+hinge loss) $$R(w)+C\sum max(0, 1- y_i(\omega x_i+b)).$$

There's another question that might be related Understanding the constraints of SVMs in the non-separable case?.

Related Question