Solved – SVM: non-linear versus linear models

linear modelnonlinearradial basissvm

In the context of classification on somewhat large datasets (say at least 50Kx50K), I am wondering in which cases non-linear models are superior to linear ones to warrant the added complexity. I often see in my own research that for these larger datasets, non-linear datasets cannot outperform linear ones (say for a linear kernel SVM and an RBF kernel SVM). But this might be biased due to my 'repository selection' of datasets which are all sparse and drawn from transactional data.

My intuition says that specifically for an RBF kernel, a linear kernel should be a lower-bound for the performance that you can achieve with an RBF kernel, but my hope of attaining higher performance than this lower bound is not fullfilled because in the end they all achieve more or less the same performance.

Specifically my question is this: have you encountered situations in which non-linear models were worth the effort? Or do you perhaps know about research confirming/rejecting my own observations?

Best Answer

In the case of high dimensional problems, linear SVMs tend to perform very well, like in the case of text classification (see for example the classic paper Text Categorization with Support Vector Machines: Learning with Many Relevant Features). It is shown how in the case of a high dimensional, sparse problem with few irrelevant features, linear SVMs achieve great performance.

Also, Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition shows how the higher the dimensionality, the more likely it is to find a separating hyperplane.

Non-linear kernel machines tend to dominate when the number of dimensions is smaller. In general, non-linear SVMs will achieve better performance, but in the circumstances referred above, that difference might not be significant, and linear SVMs are much faster to train.

Another interesting point to consider is correlation. Both, linear and non-linear are affected by highly correlated features (see this answer).

Related Question