Should the training samples all be positive examples or not?
Yes, in one class SVM (and any other outlier detection algorithm) you need just one class. If it is positive or negative depends on your naming convention, but it it more probable, that you will seek for positive examples which are underrepresented.
Which kernel function can get better result, linear kernel or RBF kernel?
"There is no free lunch". There is no general answer, the reason behind having many kernels (not just linear and rbf) is that they work well in different applications. It is data dependant decision, so you will have to test at least those two.
What is the effect of nu's values to the model?
It corresponds to the bounds on fraction of points becoming support vectors, so it limits the model's complexity (smaller the number of SVs, simplier the model and less prone to overfitting, yet prone to underfitting). As in the http://www.cms.livjm.ac.uk/library/archive/Grid%20Computing/NoveltyDetection/sch00support.pdf paper, it directly corresponds to:
- "an upper bound on the fraction of outliers"
- "a lower bound on the fraction of SVs".
As per the comments, I suspect an issue with scaling. Another possibility is poor choice of hyperparameters, in the case of one-class SVM these are $\nu$ and kernel parameters.
Scaling
When using SVM's it is appropriate to scale your data, which you have done. In scaling, however, it is important to use the same coefficients to scale both training and testing set. This is explained in this practical guide to SVM classification (see 2.2 Scaling).
If you use different coefficients, your training and test sets become incompatible. The smaller your sets are, the larger this incompatibility may be (you are quite prone to this).
I am going to guess you scaled like this:
svm-scale -l -1 -u 1 train.txt
svm-scale -l -1 -u 1 test.txt
This is wrong! The scaling tool in LIBSVM internally computes coefficients based on the minimum and maximum per feature. Clearly these may differ between data sets (the larger, the less likely the difference will be substantial).
To ensure you use a single set of coefficients, use the following commands:
svm-scale -l -1 -u 1 -s coefs.txt train.txt
svm-scale -r coefs.txt test.txt
This saves the coefficients computed based on the training set and reuses them to scale the test set. This way they are compatible.
Hyperparameters and choice of kernel
When using SVM (any formulation), it is important to use optimal values of the hyperparameters. You used the sigmoid kernel (why?), which has the following kernel function:
$$\kappa(\mathbf{x}_i,\mathbf{x}_j) = \tanh(\gamma \mathbf{x}_i^T\mathbf{x}_j + c_0)^{d}$$
This is quite a complex kernel functions with 3 tuning parameters (that is a lot). It is known to cause numerical issues. I suggest considering the RBF kernel instead, which has one tuning parameter and no numerical problems.
Since you have used one-class SVM with a sigmoid kernel, you have 4 parameters ($\nu$, $\gamma$, $c_0$ and $d$). Tuning all of these is going to be a hassle and will most definitely cause an overfit because your data sets are tiny. Yet another reason to get rid of the sigmoid kernel.
Best Answer
One class SVM are hard!!
1) First, in general, one class SVM is a unsupervised learning technique. So there is no correct answer, like there is no correct answer for the number of clusters in k-means. Like for k-means, there may be some metric that evaluates the quality of the solution but they are all heuristic and therefore there are many of them (http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation lists some of the cluster metrics implemented in sklearn). But unfortunately I do not know any quality metrics for 1-class. Hopefully someone in CV can answer that.
Nor can you use cross validation to select the hyperparameter because again you have no correct solution to measure some form of accuracy.
So unfortunately you have to set some possible values of the nu hyperparmeter and verify if the solution "makes sense" to your problem. As far as I know there is no simpler heuristics to solve this problem.
Of course, if you have the correct class for the data, that is, you know which data is normal and which is not-normal, them you can use cross validation to select the hyperparameters. In this case, your problem is really a classification problem, and you are using a 1-class as a classifier. Almost always this is not a good idea - if you have a classification problem use a classification algorithm not an unsupervised algorithm!!
2) I do not think a Linear Kernel 1-class makes a lot of sense. What 1-cass does is to find at most nu of your data to be considered as non-normal - let us call it the positive class, and the rest will be the negative class. And solve the usual SVM optimization to find the location of the separating hyperplane. In the linear kernel, the hyperplane will be a plane - that is an odd sentence! So what you get at the end is that half of the space will be called negative (the normal) and the other half positive.
Usually what one wants is to create some curved closed surface that contains the negative, and all of the "outside" will be positive. The normal data is contained in the curved closed surface (or surfaces). This can only be accomplished with a non-linear kernel - certainly the RBF kernel will do it (I dont know about the polynomial kernel).
But using a RBF kernel increases your problem because now you have two hyperparameters nu and gamma and no way to select them.
3) There is a technique called SVDD (support vector data description) that tries to create hypersheres around your negative data (which can be deformed by using kernles) but at least on the linear case, you only have one hyperparameter - the nu - but your negative space will be contained inside the sphere, and not be a half -space as in the case of the 1-class linear SVM. I have never used SVDD - I no experience in it.
And that is why 1-class SVM are hard. Sorry not much help there.