Support Vector Machines – Understanding SVMs and the Hyperplane

classificationlogisticmachine learningseparationsvm

In my project I want to create a logistic regression model for predicting binary classification (1 or 0).

I have 15 variables, 2 of which are categorical, while the rest are a mixture of continuous and discrete variables.

In order to fit a logistic regression model I have been advised to check for linear separability using either SVM, perceptron or linear programming. This ties in with suggestions made here regarding testing for linear separability.

As a newbie to machine learning I understand the basic concepts about the algorithms mentioned above but conceptually I struggle to visualise how we can separate data that has so many dimensions i.e 15 in my case.

All the examples in the online material typically show a 2D plot of two numerical variables (height,weight) which show a clear gap between categories and makes it easier to understand but in the real world data is usually of a much higher dimension. I keep being drawn back to the Iris dataset and trying to fit a hyperplane through the three species and how it's particularly difficult if not impossible to do so between two of the species, the two classes escape me right now.

How does one achieve this when we have even higher orders of dimensions, is it assumed that when we exceed a certain number of features that we use kernels to map to a higher dimensional space in order to achieve this separability?

Also in order to test for linear separability what is the metric that is used? Is it the accuracy of the SVM model i.e. the accuracy based on the confusion matrix?

Any help in better understanding this topic would be greatly appreciated. Also below is a sample of a plot of two variables in my dataset which shows how overlapping just these two variables are.

Best Answer

I'm going to try to help you gain some sense of why adding dimensions helps a linear classifier do a better job of separating two classes.

Imagine you have two continuous predictors $X_1$ and $X_2$ and $n=3$, and we're doing a binary classification. This means our data looks something like this:

Now imagine assigning some of the points to class 1 and some to class 2. Note that no matter how we assign classes to points we can always draw a line that perfectly separates the two classes.

But now let's say we add a new point:

Now there are assignments of these points to two classes such that a line cannot perfectly separate them; one such assignment is given by the coloring in the figure (this is an example of an XOR pattern, a very useful one to keep in mind when evaluating classifiers). So this shows us how with $p=2$ variables we can use a linear classifier to perfectly classify any three (non-collinear) points but we cannot in general perfectly classify 4 non-collinear points.

But what happens if we now add another predictor $X_3$?

Here lighter shaded points are closer to the origin. It may be a little hard to see, but now with $p=3$ and $n=4$ we again can perfectly classify any assignment of class labels to these points.

The general result: with $p$ predictors a linear model can perfectly classify any assignment of two classes to $p+1$ points.

The point of all of this is that if we keep $n$ fixed and increase $p$ we increase the number of patterns that we can separate, until we reach the point where we can perfectly classify any assignment of labels. With kernel SVM we implicitly fit a linear classifier in a high dimensional space, so this is why we very rarely have to worry about the existence of a separation.

For a set of possible classifiers $\mathscr F$, if for a sample of $n$ points there exist functions in $\mathscr F$ that can perfectly classify any assignment of labels to these $n$ points, we say that $\mathscr F$ can shatter n points. If $\mathscr F$ is the set of all linear classifiers in $p$ variables then $\mathscr F$ can shatter up to $n=p+1$ points. If $\mathscr F$ is the space of all measurable functions of $p$ variables then it can shatter any number of points. This notion of shattering, which tells us about the complexity of a set of possible classifiers, comes from statistical learning theory and can be used to make statements about the amount of overfitting that a set of classifiers can do. If you're interested in it I highly recommend Luxburg and Schölkopf "Statistical Learning Theory: Models, Concepts, and Results" (2008).

Related Solutions

Solved – Support Vector Machines (SVM) maximum margin hyperplane use

There are two common ways of utilizing the maximum-margin hyperplane of a trained SVM.

(1) Prediction for new data points

Based on a given training dataset, an SVM hyperplane is fully specified by its slope $w$ and an intercept $b$. (These variable names derive from a tradition established in the neural-networks literature, where the two respective quantities are referred to as 'weight' and 'bias.') As noted before, a new data point $x \in \mathbb{R}^d$ can then be classified as

\begin{align} f(x) = \textrm{sgn}\left(\langle w,x \rangle + b \right) \end{align}

where $\langle w,x \rangle$ represents the inner product. Thanks to the Karush-Kuhn-Tucker complementarity conditions, the discriminant function can be rewritten as

\begin{align} f(x) = \sum_{i \in SV} \alpha_i \langle x_i, x \rangle + b, \end{align}

where the hyperplane is implicitly encoded by the support vectors $x_i$, and where $\alpha_i$ are the support vector coefficients. The support vectors are those training data points which are closest to the separating hyperplane. Thus, predictions can be made very efficiently, since only inner products (alternatively: kernel functions) between some training points and the test point have to be evaluated.

Some have suggested also considering the distance between a new data point $x$ and the hyperplane, as an indicator of how confident the model was in its prediction. However, it is important to note that hyperplane distance itself does not afford inference; there is no probability associated with a new prediction, which is why an SVM is sometimes referred to as a point classifier. If probabilistic output is desired, other classifiers may be more appropriate, e.g., the SVM's probabilistic cousin, the relevance vector machine (RVM).

(2) Reconstructing feature weights

There is another way of putting an SVM model to use. In many classification analyses it is interesting to examine which features drove the classifier, i.e., which features played the biggest role in shaping the separating hyperplane. Given a trained SVM model with a linear kernel, these feature coefficients $w_1, \ldots, w_d$ can be reconstructed easily using

\begin{align} w = \sum_{i=1}^n y_i \alpha_i x_i \end{align}

where $x_i$ and $y_i$ represent the $i^\textrm{th}$ training example and its corresponding class label.

An important caveat of this approach is that the resulting feature weights are simple numerical coefficients without inferential quality; there is no measure of confidence associated with them. Thus, we cannot readily argue that some features were 'more important' than others, and we cannot infer that a feature with a particularly low coefficient was 'not important' in the classification problem. In order to allow for inference on feature weights, we would need to resort to more general-purpose approaches, such as the bootstrap, a permutation test, or a feature-selection algorithm embedded in a cross-validation scheme.

Classification – Measure for Separability in SVM and Linear Models

The most common measures of separability are based on how much the intra-class distributions overlap (probabilistic measures). There are a couple of these, Jeffries-Matusita distance, Bhattacharya distance and the transformed divergence. You can easily google up some descriptions. They are quite straightforward to implement.

There also some based on the behavior of nearest neighbors. The separability index, which basically looks at the proportion of neighbors that overlap. And the Hypothesis margin which looks at the distance from an object’s nearest neighbour of the same class (near-hit) and a nearest neighbour of the opposing class (near-miss). Then creates a measure by summing over this.

And then you also have things like class scatter matrices and collective entropy.

EDIT

Probabilistic separability measures in R

separability.measures <- function ( Vector.1 , Vector.2 ) {
# convert vectors to matrices in case they are not
  Matrix.1 <- as.matrix (Vector.1)
  Matrix.2 <- as.matrix (Vector.2)
# define means
mean.Matrix.1 <- mean ( Matrix.1 )
mean.Matrix.2 <- mean ( Matrix.2 )
# define difference of means
mean.difference <- mean.Matrix.1 - mean.Matrix.2
# define covariances for supplied matrices
cv.Matrix.1 <- cov ( Matrix.1 )
cv.Matrix.2 <- cov ( Matrix.2 )
# define the halfsum of cv's as "p"
p <- ( cv.Matrix.1 + cv.Matrix.2 ) / 2
# --%<------------------------------------------------------------------------
# calculate the Bhattacharryya index
bh.distance <- 0.125 *t ( mean.difference ) * p^ ( -1 ) * mean.difference +
0.5 * log (det ( p ) / sqrt (det ( cv.Matrix.1 ) * det ( cv.Matrix.2 )
)
)
# --%<------------------------------------------------------------------------
# calculate Jeffries-Matusita
# following formula is bound between 0 and 2.0
jm.distance <- 2 * ( 1 - exp ( -bh.distance ) )
# also found in the bibliography:
# jm.distance <- 1000 * sqrt (   2 * ( 1 - exp ( -bh.distance ) )   )
# the latter formula is bound between 0 and 1414.0
# --%<------------------------------------------------------------------------
# calculate the divergence
# trace (is the sum of the diagonal elements) of a square matrix
trace.of.matrix <- function ( SquareMatrix ) {
sum ( diag ( SquareMatrix ) ) }
# term 1
divergence.term.1 <- 1/2 * trace.of.matrix (( cv.Matrix.1 - cv.Matrix.2 ) * 
( cv.Matrix.2^ (-1) - cv.Matrix.1^ (-1) )
)
# term 2
divergence.term.2 <- 1/2 * trace.of.matrix (( cv.Matrix.1^ (-1) + cv.Matrix.2^ (-1) ) *
( mean.Matrix.1 - mean.Matrix.2 ) *
t ( mean.Matrix.1 - mean.Matrix.2 )
)
# divergence
divergence <- divergence.term.1 + divergence.term.2
# --%<------------------------------------------------------------------------
# and the transformed divergence
transformed.divergence  <- 2 * ( 1 - exp ( - ( divergence / 8 ) ) )
indices <- data.frame(
jm=jm.distance,bh=bh.distance,div=divergence,tdiv=transformed.divergence)
return(indices)
}

And some silly reproducible examples:

##### EXAMPLE 1
# two samples
sample.1 <- c (1362, 1411, 1457, 1735, 1621, 1621, 1791, 1863, 1863, 1838)
sample.2 <- c (1362, 1411, 1457, 10030, 1621, 1621, 1791, 1863, 1863, 1838)

# separability between these two samples
separability.measures ( sample.1 , sample.2 )

##### EXAMPLE 2
# parameters for a normal distibution
meen <- 0.2
sdevn <- 2
x <- seq(-20,20,length=5000)
# two samples from two normal distibutions
normal1 <- dnorm(x,mean=0,sd=1) # standard normal
normal2 <- dnorm(x,mean=meen, sd=sdevn) # normal with the parameters selected above

# separability between these two normal distibutions
separability.measures ( normal1 , normal2 )

Note that these measures only work for two classes and 1 variable at a time, and sometimes have some assumptions (like the classes following a normal distibution) so you should read about them before using them thoroughly. But they still might suit your needs.

Best Answer

Related Solutions

Solved – Support Vector Machines (SVM) maximum margin hyperplane use

Classification – Measure for Separability in SVM and Linear Models

Related Question