Solved – Support vector machine margin term, why norm squared

classificationleast squaresregularizationsvm

For a SVM with soft margin, we want to minimize the following:
$$ \lambda||\hat w||^2 +(1/n)\sum max(0,1-y_i(\hat w \hat x_i -b)) $$

we know that $2/||\hat w||$ is the width of the margin.

The second term penalizes a misclassified point for how far away it is from the margin relative to the width of the margin. E.g., suppose there is a misclassified point $x_0$:
$$ 1-y_0(\hat w \hat x_0 -b)=3 $$
That means $x_0$ is $3/||\hat w||$ away from $1-y_i(\hat w \hat x_i -b)=0$ and is penalized for $3$.

The first term penalizes for the inverse of the width of the margin squared. I find it hard to reconcile with the second term, they seem to be of different scales. Is there any reason (intuitively) why $||\hat w||^2$ is used instead of just $||\hat w||$?

PS: Perhaps one reason is that $||\hat w||^2 $ is easier computation-wise (quadratic programming)? Or perhaps norm squared assumes sample noise to be Gaussian? I am not sure. Has anyone seen the use of $||\hat w||$ instead of $||\hat w||^2$?

Best Answer

As far as I know, the square is introduced in the formulation for convenience. The norm will reach the optimum at the same point, and we get rid of an ugly square root.

With respect to the hinge loss term, the square just makes no difference either, because of the presence of $\lambda$. Both $f(x)=\|x\|$ and $g(x)=\|x\|^2$ are surjective functions of the form $\mathbb R^d \rightarrow \mathbb R_+$. This implies that for any value of $w$ there exists $\lambda \in \mathbb R$ such that $\lambda \|w\|=\|w\|^2$.

That is, for any solution that you find for the squared objective, you can find exactly the same one for the non-squared objective by tweaking $\lambda$.

Since the square is introduced for convenience and it makes no effective difference, I doubt you'll be able to find an intuitive reason for its being there.

Related Question