[Math] Why “hinge” loss is equivalent to 0-1 loss in SVM

machine learningself-learningstatistics

I'm reading a book named The Elements of Statistical Learning by Hastie et al. In $\S 12.3.2$ it introduced the SVM as a penalization method:

With $f(x)=h(x)^T \beta+\beta_0 $, the solution of the optimization problem
$$\min_{\beta_0,\beta} \sum_{i=1}^N [1-y_i f(x_i)]_+ +\frac{\lambda}{2}\|\beta\|^2 $$
with $\lambda=\frac{1}{C}$, is the same as that for
$$\min_{\beta_0,\beta} \frac{1}{2}\|\beta\|^2 +C \sum_{i=1}^N \xi_i$$
$$\text{subject to } \xi_i \geq 0,y_i f(x_i)\geq 1-\xi_i ~ \forall i, $$

Could anyone kindly give me some hint why they are equivalent, and what's the benefit of introducing the "hinge" loss?

Thanks a lot!

Best Answer

Here is an intuitive illustration of difference between hinge loss and 0-1 loss:

enter image description here

(The image is from Pattern recognition and Machine learning)

As you can see in this image, the black line is the 0-1 loss, blue line is the hinge loss and red line is the logistic loss. The hinge loss, compared with 0-1 loss, is more smooth. The 0-1 loss have two inflection point and it have infinite slope at 0, which is too strict and not a good mathematical property. Thus, we soft this constraint to allow certain degree misclassificiton and provide convenient calculation.

Related Question