Solved – SVM with only one type of label

svm

My goal is to find out where a user would cut a curve. During training, whenever a point on a curve is chosen by the user to be a cutting point, we record some features and use the label '+1' to indicate these features correspond to a cutting point. In order to reduce the training efforts, we would like to avoid recording the points where the user would not cut.

In other words, our training data only consists of inputs labeled with '+1'. I would like to know if there's any SVM-related technique which can handle this case. Finally, we would like the learning machine to tell us whether a point is a cutting point or not.

Best Answer

Well, the way I see it, you've got an one-class SVM problem or more broadly a open-set classification problem.

It's implemented in scikit-learn [1] [2]. Maybe taking a look on the formulation here of the Open Set Classification problem and the related machine would help.

[1] http://scikit-learn.org/stable/modules/svm.html#svm-outlier-detection

[2] http://scikit-learn.org/stable/auto_examples/svm/plot_oneclass.html#example-svm-plot-oneclass-py

Related Solutions

Machine Learning – SVM, Variable Interaction and Training Data Fit

As highBandwidth suggests, it depends whether you are using a linear SVM or a non-linear one (being pedantic if a kernel is not used it is a maximal margin linear classifier rather than an SVM).

A maximal margin linear classifier is no different from any other linear classifier in that if the data generating process means that there are interactions between the attributes, then providing those interaction terms is likely to improve performance. The maximal margin linear classifier is rather like ridge regression, with a slight difference in the penalty term that is designed to avoid overfitting (given suitable values for the regularisation parameter), and in most cases ridge regression and maximal margin classifier will give similar performance.

If you think that interaction terms are likely to be important, then you can introduce them into the feature space of an SVM by using the polynomial kernel $K(x,x') = (x\cdot x' + c)^d$, which will give a feature space in which each axis represents a monomial of order $d$ or less, the parameter $c$ affects the relative weighting of monomials of different orders. So an SVM with a polynomial kernel is equivalent to fitting a polynomial model in the attribute space, which implicitly incorporates those interactions.

Given enough features, any linear classifier can trivially fit the data. IIRC an $n$ points in "general position" in an $n-1$ dimensional space can be shattered (separated in any arbitrary manner) by a hyper-plane (c.f. VC dimension). Doing this will generally result in severe over-fitting, and so should be avoided. The point of maximal margin classifcation is to limit this over-fitting by adding a penalty term that means that the largest separation possible is achieved (which would require the greatest deviation from any training example to produce a misclassification). This means you can transform the data into a very high dimensional space (where a linear model is very powerful) without incurring too much over-fitting.

Note that some kernels give rise to an infinite dimensional feature space, where a "trivial" classification is guaranteed to be possible for any finite training sample in general position. For example, the radial basis function kernel, $K(x,x') = \exp{-\gamma\|x - x'\|^2}$, where the feature space is the positive orthant of an infinite dimensional hypersphere. Such kernels make the SVM a universal approximator, that can represent essentially any decision boundary.

However this is only part of the story. In practice, we generally use a soft-margin SVM, where the margin constrain is allowed to be violated, and there is a regularisation parameter that control the trade-off between maximising the margin (which is a penalty term, similar to that used in ridge regression) and the magnitude of the slack variables (which is akin to the loss on the training sample). We then avoid over-fitting by tuning the regularsation parameter, for example by minimising the cross-validation error (or some bound on the leave-one-out error), just as we would do in the case of ridge regression.

So while the SVM can trivially classify the training set, it will generally only do so if the regularisation and kernel parameters are badly chosen. The key to achieving good results with any kernel model lies in choosing an appropriate kernel, and then in tuning the kernel and regularisation parameters to avoid over- or under-fitting the data.

Solved – SVM, Overfitting, curse of dimensionality

In practice, the reason that SVMs tend to be resistant to over-fitting, even in cases where the number of attributes is greater than the number of observations, is that it uses regularization. They key to avoiding over-fitting lies in careful tuning of the regularization parameter, $C$, and in the case of non-linear SVMs, careful choice of kernel and tuning of the kernel parameters.

The SVM is an approximate implementation of a bound on the generalization error, that depends on the margin (essentially the distance from the decision boundary to the nearest pattern from each class), but is independent of the dimensionality of the feature space (which is why using the kernel trick to map the data into a very high dimensional space isn't such a bad idea as it might seem). So in principle SVMs should be highly resistant to over-fitting, but in practice this depends on the careful choice of $C$ and the kernel parameters. Sadly, over-fitting can also occur quite easily when tuning the hyper-parameters as well, which is my main research area, see

G. C. Cawley and N. L. C. Talbot, Preventing over-fitting in model selection via Bayesian regularisation of the hyper-parameters, Journal of Machine Learning Research, volume 8, pages 841-861, April 2007. (www)

and

G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection and subsequent selection bias in performance evaluation, Journal of Machine Learning Research, 2010. Research, vol. 11, pp. 2079-2107, July 2010. (www)

Both of those papers use kernel ridge regression, rather than the SVM, but the same problem arises just as easily with SVMs (also similar bounds apply to KRR, so there isn't that much to choose between them in practice). So in a way, SVMs don't really solve the problem of over-fitting, they just shift the problem from model fitting to model selection.

It is often a temptation to make life a bit easier for the SVM by performing some sort of feature selection first. This generally makes matters worse, as unlike the SVM, feature selection algorithms tend to exhibit more over-fitting as the number of attributes increases. Unless you want to know which are the informative attributes, it is usually better to skip the feature selection step and just use regularization to avoid over-fitting the data.

In short, there is no inherent problem with using an SVM (or other regularised model such as ridge regression, LARS, Lasso, elastic net etc.) on a problem with 120 observations and thousands of attributes, provided the regularisation parameters are tuned properly.

Best Answer

Related Solutions

Machine Learning – SVM, Variable Interaction and Training Data Fit

Solved – SVM, Overfitting, curse of dimensionality

Related Question