Kernel Functions – What Makes a Function Suitable for Machine Learning

kernel trickmachine learning

In the context of machine learning and pattern recognition, there's a concept called Kernel Trick. Facing problems where I am asked to determine whether a function could be a kernel function or not, what exactly should be done? Should I first check if they're of the form of the three or four kernel functions such as polynomial, RBF and Gaussian? Then what am I supposed to do? Should I show it is positive definite? Could someone solve an example to show an step-by-step solution for such problems? Like for example, is $f(x)=e^{x^tx'}$ a kernel function (suppose we don't know it is a Gaussian kernel)?

Best Answer

Generally, a function $k(x,y)$ is a valid kernel function (in the sense of the kernel trick) if it satisfies two key properties:

  • symmetry: $k(x,y) = k(y,x)$

  • positive semi-definiteness.

Reference: Page 4 of http://www.cs.berkeley.edu/~jordan/courses/281B-spring04/lectures/lec3.pdf

Checking symmetry is usually straightforward by inspection. Verifying positive semi-definiteness analytically can be quite hairy sometimes. I can think of two strategies for checking this fact:

  • (1) Inspecting for an "inner-product" representation

Consider $k(x,y) = e^{x+y}$. Can we find some $\phi(a)$ such that $k(x,y) = \phi(x)^T \phi(y)$? A little math shows that $e^{x+y} = e^x e^y$, so let $\phi(a)=e^a$ and we're done.

If you get lucky, your $k()$ will be amenable to this analysis. If not, you can resort to option (2):

  • (2) Checking positive definite-ness by random simulation.

Consider the function on $D$-dim vectors $k(\vec{x},\vec{y}) = \sum_{d=1}^D \min( x_d, y_d)$, where each vector $\vec{x}, \vec{y}$ must be non-negative and sum to one. Is this a valid kernel?

We can check this by simulation. Draw a set of $N$ random vectors $\{\vec{x}_i\}_{i=1}^N$ and build a Gram matrix $K$ where $K_{ij} = k( \vec{x}_i , \vec{x}_j )$. Then check if $K$ is positive (semi-) definite.

The best way to do this numerically is to find the eigenvalues of the matrix (using good existing numerical libraries like scipy or matlab), and verify that the smallest eigenvalue is larger than or equal to 0. If yes, the matrix $K$ is p.s.d. Otherwise, you do not have a valid kernel.

Sample MATLAB/Octave code:

D=5;
N=100;

X = zeros(N,D);
for n = 1:N
   xcur = rand(1,D);
   X(n,:) = xcur/sum(xcur);
end

K = zeros(N,N);
for n = 1:N;  for m = 1:N
    K(n,m) = sum( min( X(n,:), X(m,:) ) );
end;  end;

disp( min( eig(K) ) );

This is a very simple test, but be careful. If the test fails, you can be sure the kernel is not valid, but if it passes the kernel still might not be valid.

I find that no matter how many random matrices I generate and regardless of $N$ and $D$, this kernel passes the test, so it is probably positive semi-definite (in fact, this is the well-known histogram intersection kernel, and has been proven valid).

However, the same test on $k(\vec{x},\vec{y}) = \sum_{d=1}^D max( x_d, y_d)$ fails on every try I've given it (at least 20). So it is most definitely invalid, and quite easy to verify.

I really like this second option because it's quite rapid and much easier to debug than compilcated formal proofs. According to Jitendra Malik's slide 19, the intersection kernel was introduced in 1991 but not proven correct until 2005. Formal proofs can be very challenging!