Solved – Why isn’t a gaussian kernel subject to the curse of dimensionality

high-dimensionalkernel trickmachine learningnormal distribution

This has been bugging me for a while now. I understand from this answer why gaussian kernels are effective. But I can't wrap my head around the intuition of why the infinite dimensional feature map 𝜙(𝑥) in the gaussian kernel is not subject to problems arising in this infinite dimensional space. I know that the "feature mapped" vector is never directly computed, but the inner product for example is evaluated and used as a measure of similarity between vectors.

But let's assume we are working with data sampled from a hypersphere, if in higher dimensions, the data becomes subject to problems like the concentration of measure phenomenon where everything shrinks to near zero volume and sampled points becomes randomly distributed/orthogonal on average, how would a gaussian kernel benefit in this setting since it would be taking the input data and mapping it to infinite dimensional space where in theory, all points should be uniformly close together around the center of the sphere and orthogonal on average, thus providing no insight for classification/clustering?

Or does the gaussian kernel map these points to a different geometrical structure in infinite dimensional Hilbert space? If so, why is this not then subject to other problems that arise due to the curse of dimensionality?

Best Answer

If we're taking, say, $\mathcal N(0, I)$ in increasing dimensions, then sample points eventually end up uniform on the surface of a sphere. Since most directions are orthogonal in high dimensional Euclidean space, most samples end up with inner product zero.

The Gaussian feature map brings points to $\phi(X)$ in the RKHS. It's interesting to note that $\lVert \phi(x) \rVert^2 = \langle \phi(x), \phi(x) = k(x, x) = 1$ for any point $x$; that is, any point is mapped onto the surface of a sphere in the RKHS.

But are they uniform on that sphere? No: $$\langle \phi(x), \phi(y) \rangle = k(x, y) > 0$$ for any points $x$, $y$, but if the points were uniform, we'd expect inner products to be positive half the time and negative half the time.

We can choose a kernel so that points are usually approximately orthogonal to one another; just take $\sigma$ really small so that $k(x, y)$ is usually almost zero. They still won't be uniform (no negative inner products) but they'll be pretty "spread out." We can also get them to almost collapse in on themselves: if we take $\sigma \to \infty$, then $$\frac{\langle \phi(x), \phi(y) \rangle }{ \lVert \phi(x) \rVert \lVert \phi(y) \rVert } \approx 1$$ for all $x$ and $y$.

But neither of these are very useful kernels, so we don't usually use them. Instead we pick a $\sigma$ such that many points do have inner products in reasonable ranges. This avoids the usual problems associated with the curse of dimensionality.

Best Answer

Related Solutions

Solved – Do kernel methods “scale” with the amount of data

Solved – Is it always possible to find the feature map from a given kernel

Background

Mercer's Theorem

Further aspects

Related Question