Machine Learning – SVM, Variable Interaction and Training Data Fit

interactionmachine learningpredictive-modelssvm

I have 2 general/more theoretical question.

1) I'm curious how SVMs handle variable interactions when building predictive models. E.g., if I have two features f1 and f2 and the target depends on f1, f2, and say f1 * f2 (or some function h(f1, f2)), does SVM fit (not just on OOS but even on training data) improve when including f1, f2 and h(f1, f2) in the features over just including f1 and f2? Does the SVM algorithm deal with feature interactions? It seems like it would with how the SVM tries to create hyperplanes in higher dimensional space, but not sure so wanted to ask.

2) When fitting a SVM on training data, given enough features and finding the optimal parameters (via brute force search or whatever), will a SVM always trivially fit the training data? Not sure if I worded that right, but basically, if there is enough variance/noise in the features, will a SVM always fit the training data 100%? Conversely, if the SVM does not fit the training data 100%, does that mean there is some information (or other features) which affect the target variable that are not captured in the data?

Thanks

Small clarification. I'm referring to kernel SVMs specifically

Best Answer

As highBandwidth suggests, it depends whether you are using a linear SVM or a non-linear one (being pedantic if a kernel is not used it is a maximal margin linear classifier rather than an SVM).

A maximal margin linear classifier is no different from any other linear classifier in that if the data generating process means that there are interactions between the attributes, then providing those interaction terms is likely to improve performance. The maximal margin linear classifier is rather like ridge regression, with a slight difference in the penalty term that is designed to avoid overfitting (given suitable values for the regularisation parameter), and in most cases ridge regression and maximal margin classifier will give similar performance.

If you think that interaction terms are likely to be important, then you can introduce them into the feature space of an SVM by using the polynomial kernel $K(x,x') = (x\cdot x' + c)^d$, which will give a feature space in which each axis represents a monomial of order $d$ or less, the parameter $c$ affects the relative weighting of monomials of different orders. So an SVM with a polynomial kernel is equivalent to fitting a polynomial model in the attribute space, which implicitly incorporates those interactions.

Given enough features, any linear classifier can trivially fit the data. IIRC an $n$ points in "general position" in an $n-1$ dimensional space can be shattered (separated in any arbitrary manner) by a hyper-plane (c.f. VC dimension). Doing this will generally result in severe over-fitting, and so should be avoided. The point of maximal margin classifcation is to limit this over-fitting by adding a penalty term that means that the largest separation possible is achieved (which would require the greatest deviation from any training example to produce a misclassification). This means you can transform the data into a very high dimensional space (where a linear model is very powerful) without incurring too much over-fitting.

Note that some kernels give rise to an infinite dimensional feature space, where a "trivial" classification is guaranteed to be possible for any finite training sample in general position. For example, the radial basis function kernel, $K(x,x') = \exp{-\gamma\|x - x'\|^2}$, where the feature space is the positive orthant of an infinite dimensional hypersphere. Such kernels make the SVM a universal approximator, that can represent essentially any decision boundary.

However this is only part of the story. In practice, we generally use a soft-margin SVM, where the margin constrain is allowed to be violated, and there is a regularisation parameter that control the trade-off between maximising the margin (which is a penalty term, similar to that used in ridge regression) and the magnitude of the slack variables (which is akin to the loss on the training sample). We then avoid over-fitting by tuning the regularsation parameter, for example by minimising the cross-validation error (or some bound on the leave-one-out error), just as we would do in the case of ridge regression.

So while the SVM can trivially classify the training set, it will generally only do so if the regularisation and kernel parameters are badly chosen. The key to achieving good results with any kernel model lies in choosing an appropriate kernel, and then in tuning the kernel and regularisation parameters to avoid over- or under-fitting the data.