Solved – Support Vector Machines (SVM) maximum margin hyperplane use

libsvmsvm

I have been approaching to Support Vector Machines in this period, and I need a clarification about the use of the maximum margin hyperplane.

From a labeled point training set, we train a model that will correctly classify the future test set points.
To do this, we search for a best hyperplane that divide the training set points into two classes (True, False). So we use kernel and etc to get this goal.

But, when we've gotten our best hyperplane, what do we do with it?

Do we just use it to classify new data points, by verifying which ones are "on this side" and which ones "on the other side"?

I've understood how to get the famous best hyperplane, but I need some clarification on what to do next with it…

Thanks!

Best Answer

There are two common ways of utilizing the maximum-margin hyperplane of a trained SVM.

(1) Prediction for new data points

Based on a given training dataset, an SVM hyperplane is fully specified by its slope $w$ and an intercept $b$. (These variable names derive from a tradition established in the neural-networks literature, where the two respective quantities are referred to as 'weight' and 'bias.') As noted before, a new data point $x \in \mathbb{R}^d$ can then be classified as

\begin{align} f(x) = \textrm{sgn}\left(\langle w,x \rangle + b \right) \end{align}

where $\langle w,x \rangle$ represents the inner product. Thanks to the Karush-Kuhn-Tucker complementarity conditions, the discriminant function can be rewritten as

\begin{align} f(x) = \sum_{i \in SV} \alpha_i \langle x_i, x \rangle + b, \end{align}

where the hyperplane is implicitly encoded by the support vectors $x_i$, and where $\alpha_i$ are the support vector coefficients. The support vectors are those training data points which are closest to the separating hyperplane. Thus, predictions can be made very efficiently, since only inner products (alternatively: kernel functions) between some training points and the test point have to be evaluated.

Some have suggested also considering the distance between a new data point $x$ and the hyperplane, as an indicator of how confident the model was in its prediction. However, it is important to note that hyperplane distance itself does not afford inference; there is no probability associated with a new prediction, which is why an SVM is sometimes referred to as a point classifier. If probabilistic output is desired, other classifiers may be more appropriate, e.g., the SVM's probabilistic cousin, the relevance vector machine (RVM).

(2) Reconstructing feature weights

There is another way of putting an SVM model to use. In many classification analyses it is interesting to examine which features drove the classifier, i.e., which features played the biggest role in shaping the separating hyperplane. Given a trained SVM model with a linear kernel, these feature coefficients $w_1, \ldots, w_d$ can be reconstructed easily using

\begin{align} w = \sum_{i=1}^n y_i \alpha_i x_i \end{align}

where $x_i$ and $y_i$ represent the $i^\textrm{th}$ training example and its corresponding class label.

An important caveat of this approach is that the resulting feature weights are simple numerical coefficients without inferential quality; there is no measure of confidence associated with them. Thus, we cannot readily argue that some features were 'more important' than others, and we cannot infer that a feature with a particularly low coefficient was 'not important' in the classification problem. In order to allow for inference on feature weights, we would need to resort to more general-purpose approaches, such as the bootstrap, a permutation test, or a feature-selection algorithm embedded in a cross-validation scheme.