Solved – How do support vector machines avoid overfitting

svm

I understand that in the dual form of the model for support vector machines, the feature vectors are expressed only as a dot product. Mapping the feature vectors to a higher dimensional space can accommodate classes that are not linearly separable in the original feature space, but calculating this mapping and working with the higher-dimensional feature vectors is computationally prohibitive. Instead, kernels can be used to efficiently calculate the same value as the dot product of the mapped vectors.

How do support vector machines avoid overfitting? Is maximizing the margin of the decision boundary the only trick that they use, or am I missing something?

Best Answer

Maximising the margin is not the only trick (although it is very important). If a non-linear kernel function is used, then the smoothness of the kernel function also has an effect on the complexity of the classifier and hence on the risk of over-fitting. If you use a Radial Basis Function (RBF) kernel, for example, and set the scale factor (kernel parameter) to a very small value, the SVM will tend towards a linear classifier. If you use a high value, the output of the classifier will be very sensitive to small changes in the input, which means that even with margin maximisation, you are likely to get over-fitting.

Unfortunately, the performance of the SVM can be quite sensitive to the selection of the regularisation and kernel parameters, and it is possible to get over-fitting in tuning these hyper-parameters via e.g. cross-validation. The theory underpinning SVMs does nothing to prevent this form of over-fitting in model selection. See my paper on this topic:

G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection and subsequent selection bias in performance evaluation, Journal of Machine Learning Research, 2010. Research, vol. 11, pp. 2079-2107, July 2010.

Related Question