Solved – what does scaling the normal vector of a plane (/hyperplane) mean

linear algebramachine learningsvm

I understand that, scaling (multiplying or dividing by a constant) the normal vector of a plane, does not affect the plane itself.
But what happens when we do so? Are we zooming in or out of the space, like in a linear transformation?

Or in simpler words, what's the intuitive effect on scaling the normal vector of a plane(or a hyperplane)?

I came across the optimization problem whilst learning SVM (support vector machines) which goes like this:
minimize $w^\top w$ s.t. $\forall_i, \, \ y_i(w^\top\cdot x_i+b) \geq 1$,
where $w$ is the normal vector of the supporting hyperplane.
So we are basically looking to minimize the length of the normal vector of the hyperplane ($w^\top\cdot w$) subject to the constraint.
So I'm just wondering what effect could come out of minimizing the length of the normal vector.

Best Answer

We want to create as broad a separation (a wide "street") between positive and negative examples, or in other words, maximize the distance between the tips of the support vectors and the decision boundary (the median of the street):

enter image description here

If the normal vector to the decision hyperplane, $\vec w$, is normalized, $\frac{\vec w}{\Vert \vec w \Vert}$, the street width will be equal to the dot product of a vector spanning the distance between any two points in the positive and negative boundary limits ("the gutter"), ($x_{+}$ and $x_{-}$), and $\frac{\vec w}{\Vert \vec w \Vert}$.

Imposing the constraint $y_i(w^\top\cdot x_i+b) \geq 1$ will have the positive effect of maximizing the width of the street; however $b$ is not a predetermined value, and in fact, it is not part of the final Lagrangian expression. As for $y_i$, the objective is simply to keep the sign of the inequality (semi)-positive.

Given these premises, we can quickly arrive at the conclusion that maximizing the width of the "street" is equivalent to minimizing the norm of $\vec w$.

Above $x_+$ and $x_-$ are in the gutter (on hyperplanes maximizing the separation). Therefore, for the positive example: $({\bf x_i}\cdot \color{blue}{w} + b) -1 = 0$, or ${\bf x_+}\cdot \color{blue}{w} = 1 - b$; and for the negative example: $ {\bf x_-}\cdot \color{blue}{w} = -1 - b$. So, reformulating the width of the street:

$$\text{width}=(x_+ \,{\bf -}\, x_-) \cdot \frac{w}{\lVert w \rVert}= \frac{2}{\lVert w \rVert} \tag{the width of the street}$$

We just have to maximize the width of the separation, which amounts to maximizing $ \frac{2}{\lVert w \rVert}$, or minimizing $ \lVert w \rVert$.

This can easily be verified analytically, or with a geometry game:

enter image description here

The positive examples are green dots, and the negative examples, red. Notice that by changing the slope of the parallel lines (the street gutters), we make the separation broader and broader until reaching a maximum value, and along the process, the norm or length of the normal vector (chosen randomly) decreases from $\Vert\text{NormVec}\Vert=2.67$ to $\Vert\text{NormVec}\Vert=2$. The vector spanning the difference between a positive and a negative example in the gutter is $\text{DiffSV}$ (difference support vector), and its projection orthogonal to the decision boundary $\left(\text{Proj}\right)$ (projection) increases in magnitude as the norm of the normal vector to the decision hyperplane decreases.

This makes sense, because at the same time, the angle $\alpha$ between $\text{DiffSV}$ (difference support vector) and $\text{Proj}$ (projection) decreases, and the cosine increases. Therefore $\Vert w \Vert$ has to decrease to keep the trigonometric relation:

$$\text{Proj}=\Vert \text{DiffSV}\Vert \,\cos(\alpha)=\frac{\vec w}{\Vert w\Vert}\,\cdot\,\text{DiffSV}.$$

Here is a Geogebra toy example.