Solved – Projecting to lower/higher-dimensional space for classification: dimensionality reduction vs kernel trick

classificationdimensionality reductionkernel trickpcasvm

Whilst learning about classification, I have seen two different arguments. One is that projecting the data to a lower-dimensional space, such as with PCA, makes the data more easily separable. The other is that projecting to a higher-dimensional space, such as with kernel SVM, makes the separation easier.

Which is correct? Or does it depend on the data? What about projecting down with PCA, and afterwards projecting that space up with kernel SVM (or vica versa)?

Best Answer

This answer also adds points from pAt84's answer

Dimensionality reduction. I don't think reducing dimension can make your data more separable actually but it has some other benefits. First, as the dimension of your space grows, you often need more samples to be able to catch patterns, this is just a question of volume. Of course this all depends on how corrolated your data is (if you add a duplicate column it won't change). So reducing your dimension can be a necessity when you have feature vectors that has too many components, e.g. if you work with text and have a sort of TF-IDF, it would create one dimension per word (or lemma). Moreover, it prevents you from the curse of dimensionnality (e.g. if you need to perform some sort of KNN afterwards). And obviously it also fastens a lot the algorithm you run in the reduced space - though the cost of reducing dimension can be higher than the gain of running algorithms in reduced dimension. One thing is sure: dimensionnality reduction decreases information. Most of the time it does so by discarding correlations in the input data.

Kernel trick. That being said, the kernel trick is an entirely different idea. Some data that are not separable in the original space can become in a higher one: imagine two circles of radius $.5$ and $1$ sampled, those points are not separable, but if you add the distance to $0$ they become. So the idea is that the way we use a dot-product in the original space is not sufficient enough, we ought to use the dot-product in that mapped space with distance to the origine, and that's the core idea to kernels: find functions that can "scatter" our data as we want into a hilbert space. But the way I prefer to see it is that using a kernel is using a prior information on your data: how they can be compared. Roughly, a kernel is more or less a similarity matrix, and a kernel-based method is a method where you've switch the naive way of comparison (regular dot-product in the original space) to one that fits your data best (your kernel). Then, indeed, it can correspond in a mathematical way, to increasing the dimension of the space comparison happens in. Precisely: your data still lives on the same-dimensional space, but the kernel you provided acts as if you mapped your data in a higher dimensional space (RKHS), that can have an infinite dimension (e.g. with an RBF), and you used a dot-product in that space. The main trick of kernel-based method is that you actualy never explicitly map your data into the RKHS - that's why it can work! All the work is done by applying your kernel on two samples. Eventually, you haven't gained any information, meaning that it helps give a mathematical frame to explain how it works, but intuitively the idea is just that you've provided a better way to compare your data: no dimension were really explicitly added to your data (computationaly), you simply found a more suitable way to compare them.

Now nothing prevents you from doing both: you could try to categorize texts by doing a TF-IDF, then a PCA and then an RBF-SVM (though a linear one is most of the time OK with texts but nonetheless that would make perfect sense).

Related Question