Solved – Do kernel methods “scale” with the amount of data

kernel tricksvm

I've been reading about kernel methods, where you map original $N$ data points to a feature spaces, compute the kernel or gram matrix and plug that matrix into a standard, linear algorithm. This all sounds good when the feature space is infinite dimensional or otherwise very high-dimensional (much much larger than $N$), BUT the kernel matrix itself is also pretty large at $N \times N$, meaning if you double the number of points you quadruple the amount of memory required. Does this mean kernel methods do not scale well to larger data sets? Or is it not necessary to compute the entire kernel matrix and hold the whole thing in memory for most algorithms?

Best Answer

It's not necessary to hold the whole kernel matrix in memory at all times, but ofcourse you pay a price of recomputing entries if you don't. Kernel methods are very efficient in dealing with high input dimensionality thanks to the kernel trick, but as you correctly note they don't scale up that easily to large numbers of training instances.

Nonlinear SVM, for example, has a $\Omega(n^2)$ training complexity ($n$ number of instances). This is no problem for data sets up to a few million instances, but after that it is no longer feasible. At that point, approximations can be used such as fixed-size kernels or ensemble of smaller SVM base models.

Related Question