SVM – Understanding the Loss Function of Hard Margin Support Vector Machines

loss-functionssvm

People says soft margin SVM use hinge loss function: $\max(0,1-y_i(w^\intercal x_i+b))$. However, the actual objective function that soft margin SVM tries to minimize is
$$
\frac{1}{2}\|w\|^2+C\sum_i\max(0,1-y_i(w^\intercal x_i+b))
$$
Some authors call the $\|w\|^2$ term regularizer and the $\max(0,1-y_i(w^\intercal x_i+b))$ term loss function.

However, for hard margin SVM, the whole objective function is just
$$
\frac{1}{2}\|w\|^2
$$
Does that mean hard margin SVM only minimize a regularizer without any loss function? That sounds very strange.

Well, if $\frac{1}{2}\|w\|^2$ is the loss function in this case, can we call it quadratic loss function? If so, why the loss function of hard margin SVM becomes regularizer in soft margin SVM and make a change from quadratic loss to hinge loss?

Best Answer

The hinge loss term $\sum_i\max(0,1-y_i(\mathbf{w}^\intercal \mathbf{x}_i+b))$ in soft margin SVM penalizes misclassifications. In hard margin SVM there are, by definition, no misclassifications.

This indeed means that hard margin SVM tries to minimize $\|\mathbf{w}\|^2$. Due to the formulation of the SVM problem, the margin is $2/\|\mathbf{w}\|$. As such, minimizing the norm of $\mathbf{w}$ is geometrically equivalent to maximizing the margin. Exactly what we want!

Regularization is a technique to avoid overfitting by penalizing large coefficients in the solution vector. In hard margin SVM $\|\mathbf{w}\|^2$ is both the loss function and an $L_2$ regularizer.

In soft-margin SVM, the hinge loss term also acts like a regularizer but on the slack variables instead of $\mathbf{w}$ and in $L_1$ rather than $L_2$. $L_1$ regularization induces sparsity, which is why standard SVM is sparse in terms of support vectors (in contrast to least-squares SVM).