Solved – What makes the Gaussian kernel so magical for PCA, and also in general

kernel trickmachine learningpcasvm

I was reading about kernel PCA (1, 2, 3) with Gaussian and polynomial kernels.

How does the Gaussian kernel separate seemingly any sort of nonlinear data exceptionally well? Please give an intuitive analysis, as well as a mathematically involved one if possible.
What is a property of the Gaussian kernel (with ideal $\sigma$) that other kernels don't have? Neural networks, SVMs, and RBF networks come to mind.
Why don't we put the norm through, say, a Cauchy PDF and expect the same results?

Best Answer

I think the key to the magic is smoothness. My long answer which follows is simply to explain about this smoothness. It may or may not be an answer you expect.

Short answer:

Given a positive definite kernel $k$, there exists its corresponding space of functions $\mathcal{H}$. Properties of functions are determined by the kernel. It turns out that if $k$ is a Gaussian kernel, the functions in $\mathcal{H}$ are very smooth. So, a learned function (e.g, a regression function, principal components in RKHS as in kernel PCA) is very smooth. Usually smoothness assumption is sensible for most datasets we want to tackle. This explains why a Gaussian kernel is magical.

Long answer for why a Gaussian kernel gives smooth functions:

A positive definite kernel $k(x,y)$ defines (implicitly) an inner product $k(x,y)=\left\langle \phi(x),\phi(y)\right\rangle _{\mathcal{H}}$ for feature vector $\phi(x)$ constructed from your input $x$, and $\mathcal{H}$ is a Hilbert space. The notation $\left\langle \phi(x),\phi(y)\right\rangle $ means an inner product between $\phi(x)$ and $\phi(y)$. For our purpose, you can imagine $\mathcal{H}$ to be the usual Euclidean space but possibly with inifinite number of dimensions. Imagine the usual vector that is infinitely long like $\phi(x)=\left(\phi_{1}(x),\phi_{2}(x),\ldots\right)$. In kernel methods, $\mathcal{H}$ is a space of functions called reproducing kernel Hilbert space (RKHS). This space has a special property called ``reproducing property'' which is that $f(x)=\left\langle f,\phi(x)\right\rangle $. This says that to evaluate $f(x)$, first you construct a feature vector (infinitely long as mentioned) for $f$. Then you construct your feature vector for $x$ denoted by $\phi(x)$ (infinitely long). The evaluation of $f(x)$ is given by taking an inner product of the two. Obviously, in practice, no one will construct an infinitely long vector. Since we only care about its inner product, we just directly evaluate the kernel $k$. Bypassing the computation of explicit features and directly computing its inner product is known as the "kernel trick".

What are the features ?

I kept saying features $\phi_{1}(x),\phi_{2}(x),\ldots$ without specifying what they are. Given a kernel $k$, the features are not unique. But $\left\langle \phi(x),\phi(y)\right\rangle $ is uniquely determined. To explain smoothness of the functions, let us consider Fourier features. Assume a translation invariant kernel $k$, meaning $k(x,y)=k(x-y)$ i.e., the kernel only depends on the difference of the two arguments. Gaussian kernel has this property. Let $\hat{k}$ denote the Fourier transform of $k$.

In this Fourier viewpoint, the features of $f$ are given by $f:=\left(\cdots,\hat{f}_{l}/\sqrt{\hat{k}_{l}},\cdots\right)$. This is saying that the feature representation of your function $f$ is given by its Fourier transform divided by the Fourer transform of the kernel $k$. The feature representation of $x$, which is $\phi(x)$ is $\left(\cdots,\sqrt{\hat{k}_{l}}\exp\left(-ilx\right),\cdots\right)$ where $i=\sqrt{-1}$. One can show that the reproducing property holds (an exercise to readers).

As in any Hilbert space, all elements belonging to the space must have a finite norm. Let us consider the squared norm of an $f\in\mathcal{H}$:

$ \|f\|_{\mathcal{H}}^{2}=\left\langle f,f\right\rangle _{\mathcal{H}}=\sum_{l=-\infty}^{\infty}\frac{\hat{f}_{l}^{2}}{\hat{k}_{l}}. $

So when is this norm finite i.e., $f$ belongs to the space ? It is when $\hat{f}_{l}^{2}$ drops faster than $\hat{k}_{l}$ so that the sum converges. Now, the Fourier transform of a Gaussian kernel $k(x,y)=\exp\left(-\frac{\|x-y\|^{2}}{\sigma^{2}}\right)$

is another Gaussian where $\hat{k}_{l}$ decreases exponentially fast with $l$. So if $f$ is to be in this space, its Fourier transform must drop even faster than that of $k$. This means the function will have effectively only a few low frequency components with high weights. A signal with only low frequency components does not ``wiggle'' much. This explains why a Gaussian kernel gives you a smooth function.

Extra: What about a Laplace kernel ?

If you consider a Laplace kernel $k(x,y)=\exp\left(-\frac{\|x-y\|}{\sigma}\right)$, its Fourier transform is a Cauchy distribution which drops much slower than the exponential function in the Fourier transform of a Gaussian kernel. This means a function $f$ will have more high-frequency components. As a result, the function given by a Laplace kernel is ``rougher'' than that given by a Gaussian kernel.

What is a property of the Gaussian kernel that other kernels do not have ?

Regardless of the Gaussian width, one property is that Gaussian kernel is ``universal''. Intuitively, this means, given a bounded continuous function $g$ (arbitrary), there exists a function $f\in\mathcal{H}$ such that $f$ and $g$ are close (in the sense of $\|\cdot\|_{\infty})$ up to arbitrary precision needed. Basically, this means Gaussian kernel gives functions which can approximate "nice" (bounded, continuous) functions arbitrarily well. Gaussian and Laplace kernels are universal. A polynomial kernel, for example, is not.

Why don't we put the norm through, say, a Cauchy PDF and expect the same results?

In general, you can do anything you like as long as the resulting $k$ is positive definite. Positive definiteness is defined as $\sum_{i=1}^{N}\sum_{j=1}^{N}k(x_{i},x_{j})\alpha_{i}\alpha_{j}>0$ for all $\alpha_{i}\in\mathbb{R}$, $\{x_{i}\}_{i=1}^{N}$ and all $N\in\mathbb{N}$ (set of natural numbers). If $k$ is not positive definite, then it does not correspond to an inner product space. All the analysis breaks because you do not even have a space of functions $\mathcal{H}$ as mentioned. Nonetheless, it may work empirically. For example, the hyperbolic tangent kernel (see number 7 on this page)

$k(x,y) = tanh(\alpha x^\top y + c)$

which is intended to imitate sigmoid activation units in neural networks, is only positive definite for some settings of $\alpha$ and $c$. Still it was reported that it works in practice.

What about other kinds of features ?

I said features are not unique. For Gaussian kernel, another set of features is given by Mercer expansion. See Section 4.3.1 of the famous Gaussian process book. In this case, the features $\phi(x)$ are Hermite polynomials evaluated at $x$.

Best Answer

Related Solutions

Solved – Feature map for the Gaussian kernel

Solved – How to understand effect of RBF kernel for kernel PCA

Related Question