Support Vector Machines – Understanding SVMs and the Hyperplane

classificationlogisticmachine learningseparationsvm

In my project I want to create a logistic regression model for predicting binary classification (1 or 0).

I have 15 variables, 2 of which are categorical, while the rest are a mixture of continuous and discrete variables.

In order to fit a logistic regression model I have been advised to check for linear separability using either SVM, perceptron or linear programming. This ties in with suggestions made here regarding testing for linear separability.

As a newbie to machine learning I understand the basic concepts about the algorithms mentioned above but conceptually I struggle to visualise how we can separate data that has so many dimensions i.e 15 in my case.

All the examples in the online material typically show a 2D plot of two numerical variables (height,weight) which show a clear gap between categories and makes it easier to understand but in the real world data is usually of a much higher dimension. I keep being drawn back to the Iris dataset and trying to fit a hyperplane through the three species and how it's particularly difficult if not impossible to do so between two of the species, the two classes escape me right now.

How does one achieve this when we have even higher orders of dimensions, is it assumed that when we exceed a certain number of features that we use kernels to map to a higher dimensional space in order to achieve this separability?

Also in order to test for linear separability what is the metric that is used? Is it the accuracy of the SVM model i.e. the accuracy based on the confusion matrix?

Any help in better understanding this topic would be greatly appreciated. Also below is a sample of a plot of two variables in my dataset which shows how overlapping just these two variables are.

enter image description here

Best Answer

I'm going to try to help you gain some sense of why adding dimensions helps a linear classifier do a better job of separating two classes.

Imagine you have two continuous predictors $X_1$ and $X_2$ and $n=3$, and we're doing a binary classification. This means our data looks something like this:

n=3

Now imagine assigning some of the points to class 1 and some to class 2. Note that no matter how we assign classes to points we can always draw a line that perfectly separates the two classes.

But now let's say we add a new point:

n=4

Now there are assignments of these points to two classes such that a line cannot perfectly separate them; one such assignment is given by the coloring in the figure (this is an example of an XOR pattern, a very useful one to keep in mind when evaluating classifiers). So this shows us how with $p=2$ variables we can use a linear classifier to perfectly classify any three (non-collinear) points but we cannot in general perfectly classify 4 non-collinear points.

But what happens if we now add another predictor $X_3$?

p=3, n=4

Here lighter shaded points are closer to the origin. It may be a little hard to see, but now with $p=3$ and $n=4$ we again can perfectly classify any assignment of class labels to these points.

The general result: with $p$ predictors a linear model can perfectly classify any assignment of two classes to $p+1$ points.

The point of all of this is that if we keep $n$ fixed and increase $p$ we increase the number of patterns that we can separate, until we reach the point where we can perfectly classify any assignment of labels. With kernel SVM we implicitly fit a linear classifier in a high dimensional space, so this is why we very rarely have to worry about the existence of a separation.

For a set of possible classifiers $\mathscr F$, if for a sample of $n$ points there exist functions in $\mathscr F$ that can perfectly classify any assignment of labels to these $n$ points, we say that $\mathscr F$ can shatter n points. If $\mathscr F$ is the set of all linear classifiers in $p$ variables then $\mathscr F$ can shatter up to $n=p+1$ points. If $\mathscr F$ is the space of all measurable functions of $p$ variables then it can shatter any number of points. This notion of shattering, which tells us about the complexity of a set of possible classifiers, comes from statistical learning theory and can be used to make statements about the amount of overfitting that a set of classifiers can do. If you're interested in it I highly recommend Luxburg and Schölkopf "Statistical Learning Theory: Models, Concepts, and Results" (2008).