There are two common ways of utilizing the maximum-margin hyperplane of a trained SVM.
(1) Prediction for new data points
Based on a given training dataset, an SVM hyperplane is fully specified by its slope $w$ and an intercept $b$. (These variable names derive from a tradition established in the neural-networks literature, where the two respective quantities are referred to as 'weight' and 'bias.') As noted before, a new data point $x \in \mathbb{R}^d$ can then be classified as
\begin{align}
f(x) = \textrm{sgn}\left(\langle w,x \rangle + b \right)
\end{align}
where $\langle w,x \rangle$ represents the inner product. Thanks to the Karush-Kuhn-Tucker complementarity conditions, the discriminant function can be rewritten as
\begin{align}
f(x) = \sum_{i \in SV} \alpha_i \langle x_i, x \rangle + b,
\end{align}
where the hyperplane is implicitly encoded by the support vectors $x_i$, and where $\alpha_i$ are the support vector coefficients. The support vectors are those training data points which are closest to the separating hyperplane. Thus, predictions can be made very efficiently, since only inner products (alternatively: kernel functions) between some training points and the test point have to be evaluated.
Some have suggested also considering the distance between a new data point $x$ and the hyperplane, as an indicator of how confident the model was in its prediction. However, it is important to note that hyperplane distance itself does not afford inference; there is no probability associated with a new prediction, which is why an SVM is sometimes referred to as a point classifier. If probabilistic output is desired, other classifiers may be more appropriate, e.g., the SVM's probabilistic cousin, the relevance vector machine (RVM).
(2) Reconstructing feature weights
There is another way of putting an SVM model to use. In many classification analyses it is interesting to examine which features drove the classifier, i.e., which features played the biggest role in shaping the separating hyperplane. Given a trained SVM model with a linear kernel, these feature coefficients $w_1, \ldots, w_d$ can be reconstructed easily using
\begin{align}
w = \sum_{i=1}^n y_i \alpha_i x_i
\end{align}
where $x_i$ and $y_i$ represent the $i^\textrm{th}$ training example and its corresponding class label.
An important caveat of this approach is that the resulting feature weights are simple numerical coefficients without inferential quality; there is no measure of confidence associated with them. Thus, we cannot readily argue that some features were 'more important' than others, and we cannot infer that a feature with a particularly low coefficient was 'not important' in the classification problem. In order to allow for inference on feature weights, we would need to resort to more general-purpose approaches, such as the bootstrap, a permutation test, or a feature-selection algorithm embedded in a cross-validation scheme.
The most common measures of separability are based on how much the intra-class distributions overlap (probabilistic measures). There are a couple of these, Jeffries-Matusita distance, Bhattacharya distance and the transformed divergence. You can easily google up some descriptions. They are quite straightforward to implement.
There also some based on the behavior of nearest neighbors. The separability index, which basically looks at the proportion of neighbors that overlap. And the Hypothesis margin which looks at the distance from an object’s nearest neighbour of the same class (near-hit) and a nearest neighbour of the opposing class (near-miss). Then creates a measure by summing over this.
And then you also have things like class scatter matrices and collective entropy.
EDIT
Probabilistic separability measures in R
separability.measures <- function ( Vector.1 , Vector.2 ) {
# convert vectors to matrices in case they are not
Matrix.1 <- as.matrix (Vector.1)
Matrix.2 <- as.matrix (Vector.2)
# define means
mean.Matrix.1 <- mean ( Matrix.1 )
mean.Matrix.2 <- mean ( Matrix.2 )
# define difference of means
mean.difference <- mean.Matrix.1 - mean.Matrix.2
# define covariances for supplied matrices
cv.Matrix.1 <- cov ( Matrix.1 )
cv.Matrix.2 <- cov ( Matrix.2 )
# define the halfsum of cv's as "p"
p <- ( cv.Matrix.1 + cv.Matrix.2 ) / 2
# --%<------------------------------------------------------------------------
# calculate the Bhattacharryya index
bh.distance <- 0.125 *t ( mean.difference ) * p^ ( -1 ) * mean.difference +
0.5 * log (det ( p ) / sqrt (det ( cv.Matrix.1 ) * det ( cv.Matrix.2 )
)
)
# --%<------------------------------------------------------------------------
# calculate Jeffries-Matusita
# following formula is bound between 0 and 2.0
jm.distance <- 2 * ( 1 - exp ( -bh.distance ) )
# also found in the bibliography:
# jm.distance <- 1000 * sqrt ( 2 * ( 1 - exp ( -bh.distance ) ) )
# the latter formula is bound between 0 and 1414.0
# --%<------------------------------------------------------------------------
# calculate the divergence
# trace (is the sum of the diagonal elements) of a square matrix
trace.of.matrix <- function ( SquareMatrix ) {
sum ( diag ( SquareMatrix ) ) }
# term 1
divergence.term.1 <- 1/2 * trace.of.matrix (( cv.Matrix.1 - cv.Matrix.2 ) *
( cv.Matrix.2^ (-1) - cv.Matrix.1^ (-1) )
)
# term 2
divergence.term.2 <- 1/2 * trace.of.matrix (( cv.Matrix.1^ (-1) + cv.Matrix.2^ (-1) ) *
( mean.Matrix.1 - mean.Matrix.2 ) *
t ( mean.Matrix.1 - mean.Matrix.2 )
)
# divergence
divergence <- divergence.term.1 + divergence.term.2
# --%<------------------------------------------------------------------------
# and the transformed divergence
transformed.divergence <- 2 * ( 1 - exp ( - ( divergence / 8 ) ) )
indices <- data.frame(
jm=jm.distance,bh=bh.distance,div=divergence,tdiv=transformed.divergence)
return(indices)
}
And some silly reproducible examples:
##### EXAMPLE 1
# two samples
sample.1 <- c (1362, 1411, 1457, 1735, 1621, 1621, 1791, 1863, 1863, 1838)
sample.2 <- c (1362, 1411, 1457, 10030, 1621, 1621, 1791, 1863, 1863, 1838)
# separability between these two samples
separability.measures ( sample.1 , sample.2 )
##### EXAMPLE 2
# parameters for a normal distibution
meen <- 0.2
sdevn <- 2
x <- seq(-20,20,length=5000)
# two samples from two normal distibutions
normal1 <- dnorm(x,mean=0,sd=1) # standard normal
normal2 <- dnorm(x,mean=meen, sd=sdevn) # normal with the parameters selected above
# separability between these two normal distibutions
separability.measures ( normal1 , normal2 )
Note that these measures only work for two classes and 1 variable at a time, and sometimes have some assumptions (like the classes following a normal distibution) so you should read about them before using them thoroughly. But they still might suit your needs.
Best Answer
I'm going to try to help you gain some sense of why adding dimensions helps a linear classifier do a better job of separating two classes.
Imagine you have two continuous predictors $X_1$ and $X_2$ and $n=3$, and we're doing a binary classification. This means our data looks something like this:
Now imagine assigning some of the points to class 1 and some to class 2. Note that no matter how we assign classes to points we can always draw a line that perfectly separates the two classes.
But now let's say we add a new point:
Now there are assignments of these points to two classes such that a line cannot perfectly separate them; one such assignment is given by the coloring in the figure (this is an example of an XOR pattern, a very useful one to keep in mind when evaluating classifiers). So this shows us how with $p=2$ variables we can use a linear classifier to perfectly classify any three (non-collinear) points but we cannot in general perfectly classify 4 non-collinear points.
But what happens if we now add another predictor $X_3$?
Here lighter shaded points are closer to the origin. It may be a little hard to see, but now with $p=3$ and $n=4$ we again can perfectly classify any assignment of class labels to these points.
The general result: with $p$ predictors a linear model can perfectly classify any assignment of two classes to $p+1$ points.
The point of all of this is that if we keep $n$ fixed and increase $p$ we increase the number of patterns that we can separate, until we reach the point where we can perfectly classify any assignment of labels. With kernel SVM we implicitly fit a linear classifier in a high dimensional space, so this is why we very rarely have to worry about the existence of a separation.
For a set of possible classifiers $\mathscr F$, if for a sample of $n$ points there exist functions in $\mathscr F$ that can perfectly classify any assignment of labels to these $n$ points, we say that $\mathscr F$ can shatter n points. If $\mathscr F$ is the set of all linear classifiers in $p$ variables then $\mathscr F$ can shatter up to $n=p+1$ points. If $\mathscr F$ is the space of all measurable functions of $p$ variables then it can shatter any number of points. This notion of shattering, which tells us about the complexity of a set of possible classifiers, comes from statistical learning theory and can be used to make statements about the amount of overfitting that a set of classifiers can do. If you're interested in it I highly recommend Luxburg and Schölkopf "Statistical Learning Theory: Models, Concepts, and Results" (2008).