Kernelization – How It Improves the K Nearest Neighbour Algorithm

k nearest neighbourkernel trickmachine learning

I'm new to kernels and have hit a snag while trying to kernelise kNN.

Preliminaries

I'm using a polynomial kernel:
$K(\mathbf{x},\mathbf{y}) = (1 + \langle \mathbf{x},\mathbf{y} \rangle)^d$

Your typical Euclidean kNN uses the following distance metric:
$d(\mathbf{x}, \mathbf{y}) = \vert\vert \mathbf{x} – \mathbf{y} \vert\vert$

Let $f(\mathbf{x})$ map $\mathbf{x}$ into some higher-dimensional feature space. Then the square of the above distance metric in Hilbert space can be expressed by inner products:
$d^2(f(x), f(y)) = K(\mathbf{x},\mathbf{x}) – 2K(\mathbf{x}, \mathbf{y}) + K(\mathbf{y} ,\mathbf{y})$

Note that if we let $d = 1$ the above will degenerate to your standard Euclidean distance.

The Question

The main problem I have is that I cannot see how kernelising kNN produces better results as experimentally shown by, e.g, this paper (warning, direct pdf link!).

Best Answer

Cover's Theorem: Roughly stated, it says given any random set of finite points (with arbitrary labels), then with high probability these points can be made linearly separable [1] by mapping them to a higher dimension [2].

Implication: Great, what this theorem tells me is that if I take my dataset and map these points to a higher dimension, then I can easily find a linear classifier. However, most classifiers need to compute some kind of similarity like dot product and this means that the time complexity of a classification algorithm is proportional to the dimension of the data point. So, higher dimension means larger time complexity (not to mention space complexity to store those large dimensional points).

Kernel Trick: Let $n$ be the original dimension of data points and $f$ be the map which maps these points to a space of dimension $N (>> n)$. Now, if there is a function $K$ which takes inputs $x$ and $y$ from the original space and computes $K(x, y) = \langle f(x), f(y) \rangle$, then I am able to compute the dot product in higher dimensional space but in complexity $O(n)$ instead of $O(N)$.

Implication: So, if the classification algorithm is only dependent on the dot product and has no dependency on the actual map $f$, I can use the kernel trick to run the algorithm in high dimensional space with almost no additional cost.

Does Linear separability imply that points from the same class will get closer than the points from different classes? No, there is no such guarantee as such. Linear separability doesn't really imply that the point from same class has gotten closer or that the points from two different classes have gotten any further.

So why would kNN work? It need not! However, if it does, then it is purely because of the kernel.

What does that mean? Consider the boolean feature vector $x = (x_1, x_2)$. When you use degree two polynomial kernel, the feature vector $x$ is mapped to the vector $(x_1^2, \sqrt{2} x_1x_2, x_2^2)$. From a vector of boolean features, just by using degree two polynomial, we have obtained a feature vector of "conjunctions". Thus, the kernels themselves produce some brilliant feature maps. If your data has good original features and if your data could benefit from the feature maps created by these kernels. By benefit, I mean that the features produced by these feature maps can bring the points from the same class closer to each other and push points from different classes away, then kNN stands to benefit from using kernels. Otherwise, the results won't be any different than what you get from running kNN on the original data.

Then why use kernel kNN? We showed that the computation complexity of using kernels is just slightly more than the usual kNN and if data benefits from using kernels then why not use them anyway?

Is there any paper that has studied which class of data that can benefit from kernels in kNN? As far as I know, No.

[1] http://en.wikipedia.org/wiki/Linear_separability
[2] http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4038449&tag=1

Related Question