Solved – Does using a kernel function make the data linearly separable? If so, why using soft-margin SVM

classificationmachine learningsvm

I'm reading about SVM and I learned that we use a kernel function so the data become linearly separable in the high dimensional (vector?) space. But then I also learned that they use the soft-margin idea. But my question is why to use a soft-margin if the data is going to be linearly separable anyway in the high space? Or does that mean that even after mapping with the kernel it doesn't necessarily mean that it will become linearly separable?

Best Answer

Does using a kernel function make the data linearly separable?

In some cases, but not others. For example, the linear kernel induces a feature space that's equivalent to the original input space (up to dot product preserving transformations like rotation and reflection). If the data aren't linearly separable in input space, they won't be in feature space either. Polynomial kernels with degree >1 map the data nonlinearly into a higher dimensional feature space. Data that aren't linearly separable in input space may be linearly separable in feature space (depending on the particular data and kernel), but may not be in other cases. RBF kernels map the data nonlinearly into an infinite-dimensional feature space. If the kernel bandwidth is chosen small enough, the data are always linearly separable in feature space.

When linear separability is possible, why use a soft-margin SVM?

The input features may not contain enough information about class labels to perfectly predict them. In these cases, perfectly separating the training data would be overfitting, and would hurt generalization performance. Consider the following example, where points from one class are drawn from an isotropic Gaussian distribution, and points from the other are drawn from a surrounding, ring-shaped distribution. The optimal decision boundary is a circle through the low density region between these distributions. The data aren't truly separable because the distributions overlap, and points from each class end up on the wrong side of the optimal decision boundary.

enter image description here

As mentioned above, an RBF kernel with small bandwidth allows linear separability of the training data in feature space. A hard-margin SVM using this kernel achieves perfect accuracy on the training set (background color indicates predicted class, point color indicates actual class):

enter image description here

The hard margin SVM maximizes the margin, subject to the constraint that no training point is misclassified. The RBF kernel ensures that it's possible to meet this constraint. However, the resulting decision boundary is completely overfit, and will not generalize well to future data.

Instead, we can use a soft margin SVM, which allows some margin violations and misclassifications in exchange for a bigger margin (the tradeoff is controlled by a hyperparameter). The hope is that a bigger margin will increase generalization performance. Here's the output for a soft margin SVM with the same RBF kernel:

enter image description here

Despite more errors on the training set, the decision boundary is closer to the true boundary, and the soft margin SVM will generalize better. Further improvements could be made by tweaking the kernel.