Solved – One Class SVM strange decision boundary

outlierssvm

I am trying to plot the decision boundary of a One Class SVM.

This is a 2 dimensional representation of my training data
Train Data

And here the picture of the prediction obtained on the training data

Prediction

How can you see the values seems to be assigned randomly. In theory I should obtain a ball that contains great part of the data.
Another problem is the absence of the decision boundary that according to the code should be plotted in the second figure.

Am I doing something wrong?

Here the code that I have used to generate the results.

DG = DataGenerator()

X, y = DG.J2M()

train_index = list(range(150)) + list(range(250, 500))
print('PCA')
pca = PCA(n_components=2)
pca.fit(sk.preprocessing.scale(X[train_index, :], with_std=False))

x_pca = pca.transform(X)

x1 = x_pca[train_index, :]

pl.figure()
pl.scatter(x1[:, 0], x1[:, 1], c=y[train_index])
pl.title('train data')
pl.show()


print('Train SVM')
OCSVM = OneClassSVM(kernel='rbf', degree=3, gamma=0.0, coef0=0.0,
                    tol=0.001, nu=.6, shrinking=True, cache_size=200,
                    verbose=False, max_iter=-1, random_state=None)

OCSVM.fit(x1)




a1 = x1[:,0].min() - 20
a2 = x1[:,0].max() + 20
b1 = x1[:,1].min() - 20
b2 = x1[:,1].max() + 20

xx1, yy1 = np.meshgrid(np.linspace(a1, a2, 1000), np.linspace(b1, b2,1000))

Z1 = OCSVM.decision_function(np.c_[xx1.ravel(), yy1.ravel()])
Z1 = Z1.reshape(xx1.shape)

y_est1 = OCSVM.predict(x1)
pl.figure()
pl.scatter(x1[:,0], x1[:,1], c=y_est1)
pl.contour(xx1, yy1, Z1, levels=[0],
           linewidths=2)
pl.title('prediction of x1')
pl.show()

EDIT:

I have tried including more data in my estimator and using different values of $\nu$ and $\gamma$

print(x_pca.shape) #(500, 2)

OCSVM = OneClassSVM(kernel='rbf', degree=2, gamma=1./p, coef0=0.0,
                    tol=0.001, nu=0.1, shrinking=True, cache_size=200,
                    verbose=False, max_iter=-1, random_state=None)

OCSVM.fit(x_pca)
y_est = OCSVM.predict(x_pca)

pl.figure()
pl.scatter(x_pca[:, 0], x_pca[:, 1], c=y_est)
pl.show()

as can bee seen in the figure there is a clear separation between the faulty and the normal data. Despite that I am not able to train a good estimator.

enter image description here

Best Answer

The value of nu is too large. 0.6 means that you allow up to 60% of outliers (at least, asymptotically). You should bring it down to 1%, 0.1% or whatever you expect the novelty rate to be. Also notice that if you want to use gamma=0, then you can just remove it from the list of parameters. gamma=0.0 will be equivalent to gamma=1/n_features.

Related Solutions

Solved – Computing the decision boundary of a linear SVM model

The Elements of Statistical Learning, from Hastie et al., has a complete chapter on support vector classifiers and SVMs (in your case, start page 418 on the 2nd edition). Another good tutorial is Support Vector Machines in R, by David Meyer.

Unless I misunderstood your question, the decision boundary (or hyperplane) is defined by $x^T\beta + \beta_0=0$ (with $\|\beta\|=1$, and $\beta_0$ the intercept term), or as @ebony said a linear combination of the support vectors. The margin is then $2/\|\beta\|$, following Hastie et al. notations.

From the on-line help of ksvm() in the kernlab R package, but see also kernlab – An S4 Package for Kernel Methods in R, here is a toy example:

set.seed(101)
x <- rbind(matrix(rnorm(120),,2),matrix(rnorm(120,mean=3),,2))
y <- matrix(c(rep(1,60),rep(-1,60)))
svp <- ksvm(x,y,type="C-svc")
plot(svp,data=x)

Note that for the sake of clarity, we don't consider train and test samples. Results are shown below, where color shading helps visualizing the fitted decision values; values around 0 are on the decision boundary.

alt text

Calling attributes(svp) gives you attributes that you can access, e.g.

alpha(svp)  # support vectors whose indices may be 
            # found with alphaindex(svp)
b(svp)      # (negative) intercept

So, to display the decision boundary, with its corresponding margin, let's try the following (in the rescaled space), which is largely inspired from a tutorial on SVM made some time ago by Jean-Philippe Vert:

plot(scale(x), col=y+2, pch=y+2, xlab="", ylab="")
w <- colSums(coef(svp)[[1]] * x[unlist(alphaindex(svp)),])
b <- b(svp)
abline(b/w[1],-w[2]/w[1])
abline((b+1)/w[1],-w[2]/w[1],lty=2)
abline((b-1)/w[1],-w[2]/w[1],lty=2)

And here it is:

alt text

Machine Learning – One-Class SVM vs. Exemplar SVM

(You may want to look at the "table" below first)

Let's start with the "classic" support vector machines. These learn to discriminate between two categories. You collect some examples of category A, some of category B and pass them both to the SVM training algorithm, which finds the line/plane/hyperplane that best separates A from B. This works--and it often works quite well--when you want to distinguish between well-defined and mutually exclusive classes: men vs. women, the letters of the alphabet, and so on.

However, suppose you want to identify "A"s instead. You could treat this as a classification problem: How do I distinguish "A"s from "not-A"s. It is fairly easy to gather up a training set consisting of pictures of dogs, but what should go into your training set of not-dogs? Since there are an infinite number of things that are not dogs, you might have a difficult time constructing a comprehensive and yet representative training set of all non-canine things. Instead, you might consider using a one-class classifier. The traditional, two-class classifier finds a (hyper)plane that separates A from B. The one-class SVM instead finds the line/plane/hyperplane that separates all of the in-class points (the "A"s) from origin; it is essentially a two-class SVM where the origin is the only member of the second class (finding the maximum margin from the origin is pretty similar to finding the smallest sphere that contains all As, which might make more conceptual sense).

The Ensemble SVM "system" is actually a collection of many two-class SVM "subunits". Each subunit is trained using a single positive example for one class and an enormous collection of negative examples for the other. Thus, instead of discriminating dogs vs. not-dog examples (standard two-class SVM), or dogs vs. origin (one-class SVM), each subunit discriminates between specific dog (e.g., "Rex") and many not-dog examples. Individual subunit SVMs are trained for each example of the positive class, so you would have one SVM for Rex, another for Fido, yet another for your neighbour’s dog that barks at 6am, and so on. The outputs of these subunit SVMs are calibrated and combined to determine whether a dog, not just one of the specific exemplars, appears in the test data. I guess you could also think of the individual subnits as somewhat like one-class SVMs, where the coordinate space is shifted so that the single positive example lies at the origin.

In summary, the key differences are:

Training Data

Two class SVM: Positive and negative examples
One class SVM: Positive examples only
Ensemble SVM "system": Positive and negative examples. Each subunit is trained on a single positive example and many negative examples.

Number of Machines

Two class SVM: one
One class SVM: one
Ensemble SVM "system": many (one subunit machine per positive example)

Examples per class (per machine)

Two class SVM: many/many
One class SVM: many/one (fixed at the origin)
Ensemble SVM "system": many/many
Ensemble SVM "subunit": one/many

Post-processing

Two class SVM: Not necessary
One class SVM: Not necessary
Ensemble SVM: Needed to fuse each SVM's output into a class-level prediction.

Postscript: You had asked what they mean by "[other approaches] require mapping the exemplars into a common feature space over which a similarity kernel can be computed." I think they mean that a traditional two-class SVM operates under the assumption that all members of class are somehow similar, and so you want to find a kernel that places great danes and dachsunds near each other, but far away from everything else. By contrast, the ensemble SVM system sidesteps this by calling something a dog if it's sufficiently great dane-like OR dachsund-like OR poodle-like, without worrying about the relationship between those exemplars.