Solved – nu parameter in one-class SVM with linear kernel

one-classoptimizationsvm

I have made a one-class SVM model using the Linear kernel.

I trained my model with 100000 positive examples, each one having a vector of 59 dimensions.

The examples come from a public dataset, which I consider it to be high quality for my purpose.

I chose to set the value of nu parameter to 0.01 and the prediction results I am getting are not that bad. I was thinking whether my nu value is set correctly or not, so I started looking for advice on setting the nu parameter for one-class svm with linear kernel, but couldn't find any resources specific to that.

I am aware that:

The parameter nu is an upper bound on the fraction of margin errors
and a lower bound of the fraction of support vectors relative to the
total number of training examples

however, I am not sure how should I apply the above definition for one-class linear kernel SVM

Any advice/resources on that would be really helpful. Thanks in advance.

Best Answer

One class SVM are hard!!

1) First, in general, one class SVM is a unsupervised learning technique. So there is no correct answer, like there is no correct answer for the number of clusters in k-means. Like for k-means, there may be some metric that evaluates the quality of the solution but they are all heuristic and therefore there are many of them (http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation lists some of the cluster metrics implemented in sklearn). But unfortunately I do not know any quality metrics for 1-class. Hopefully someone in CV can answer that.

Nor can you use cross validation to select the hyperparameter because again you have no correct solution to measure some form of accuracy.

So unfortunately you have to set some possible values of the nu hyperparmeter and verify if the solution "makes sense" to your problem. As far as I know there is no simpler heuristics to solve this problem.

Of course, if you have the correct class for the data, that is, you know which data is normal and which is not-normal, them you can use cross validation to select the hyperparameters. In this case, your problem is really a classification problem, and you are using a 1-class as a classifier. Almost always this is not a good idea - if you have a classification problem use a classification algorithm not an unsupervised algorithm!!

2) I do not think a Linear Kernel 1-class makes a lot of sense. What 1-cass does is to find at most nu of your data to be considered as non-normal - let us call it the positive class, and the rest will be the negative class. And solve the usual SVM optimization to find the location of the separating hyperplane. In the linear kernel, the hyperplane will be a plane - that is an odd sentence! So what you get at the end is that half of the space will be called negative (the normal) and the other half positive.

Usually what one wants is to create some curved closed surface that contains the negative, and all of the "outside" will be positive. The normal data is contained in the curved closed surface (or surfaces). This can only be accomplished with a non-linear kernel - certainly the RBF kernel will do it (I dont know about the polynomial kernel).

But using a RBF kernel increases your problem because now you have two hyperparameters nu and gamma and no way to select them.

3) There is a technique called SVDD (support vector data description) that tries to create hypersheres around your negative data (which can be deformed by using kernles) but at least on the linear case, you only have one hyperparameter - the nu - but your negative space will be contained inside the sphere, and not be a half -space as in the case of the 1-class linear SVM. I have never used SVDD - I no experience in it.

And that is why 1-class SVM are hard. Sorry not much help there.

Scaling

When using SVM's it is appropriate to scale your data, which you have done. In scaling, however, it is important to use the same coefficients to scale both training and testing set. This is explained in this practical guide to SVM classification (see 2.2 Scaling).

If you use different coefficients, your training and test sets become incompatible. The smaller your sets are, the larger this incompatibility may be (you are quite prone to this).

I am going to guess you scaled like this:

svm-scale -l -1 -u 1 train.txt
svm-scale -l -1 -u 1 test.txt

This is wrong! The scaling tool in LIBSVM internally computes coefficients based on the minimum and maximum per feature. Clearly these may differ between data sets (the larger, the less likely the difference will be substantial).

To ensure you use a single set of coefficients, use the following commands:

svm-scale -l -1 -u 1 -s coefs.txt train.txt
svm-scale -r coefs.txt test.txt

This saves the coefficients computed based on the training set and reuses them to scale the test set. This way they are compatible.

Hyperparameters and choice of kernel

When using SVM (any formulation), it is important to use optimal values of the hyperparameters. You used the sigmoid kernel (why?), which has the following kernel function: $$\kappa(\mathbf{x}_i,\mathbf{x}_j) = \tanh(\gamma \mathbf{x}_i^T\mathbf{x}_j + c_0)^{d}$$

This is quite a complex kernel functions with 3 tuning parameters (that is a lot). It is known to cause numerical issues. I suggest considering the RBF kernel instead, which has one tuning parameter and no numerical problems.

Since you have used one-class SVM with a sigmoid kernel, you have 4 parameters ($\nu$, $\gamma$, $c_0$ and $d$). Tuning all of these is going to be a hassle and will most definitely cause an overfit because your data sets are tiny. Yet another reason to get rid of the sigmoid kernel.

Best Answer

Related Solutions

Solved – Training one class SVM using LibSVM

Solved – One class classification with libsvm. Accuracy results in 0%

Scaling

Hyperparameters and choice of kernel

Related Question