Kernel Methods – Limitations and Use Cases

kernel trickmachine learning

Kernel methods are very effective in many supervised classification tasks. So what are the limitations of kernel methods and when to use kernel methods? Especially in the large scale data era, what are the advances of kernel methods? What is the difference between kernel methods and multiple instance learning?
If the data is 500x10000, 500 is the count of samples, and 10000 is the dimension of each feature, then in this circumstance, can we use the kernel methods?

Best Answer

Kernel methods can be used for supervised and unsupervised problems. Well-known examples are the support vector machine and kernel spectral clustering, respectively.

Kernel methods provide a structured way to use a linear algorithm in a transformed feature space, for which the transformation is typically nonlinear (and to a higher dimensional space). The key advantage this so-called kernel trick brings is that nonlinear patterns can be found at a reasonable computational cost.

Note that I said the computational cost is reasonable, but not negligible. Kernel methods typically construct a kernel matrix $\mathbf{K} \in \mathbb{R}^{N\times N}$ with $N$ the number of training instances. The complexity of kernel methods is therefore a function of the number of training instances, rather than the number of input dimensions. Support vector machines, for example, have a training complexity between $O(N^2)$ and $O(N^3)$. For problems with very large $N$, this complexity is currently prohibitive.

This makes kernel methods very interesting from a computational perspective when the number of dimensions is large and the number of samples is relatively low (say, less than 1 million).

Related: Linear kernel and non-linear kernel for support vector machine?

SVM for Large Scale Problems

For very high dimensional problems, such as the 10000 dimensions you mention in the question, there is often no need to map to a higher dimensional feature space. The input space is already good enough. For such problems, linear methods are orders of magnitude faster with almost the same predictive performance. Examples of these methods can be found in LIBLINEAR or Vowpal Wabbit.

Linear methods are particularly interesting when you have many samples in a high dimensional input space. When you have only $500$ samples, using a nonlinear kernel method will also be cheap (since $N$ is small). If you had, say, $5.000.000$ samples in $10.000$ dimensions, kernel methods would be infeasible.

For low-dimensional problems with many training instances (so-called large $N$ small $p$ problems), linear methods may yield poor predictive accuracy. For such problems, ensemble methods such as EnsembleSVM provide nonlinear decision boundaries at significantly reduced computational cost compared to standard SVM.

Related Question