Solved – How to SVM ‘find’ an infinite feature space where linear separation is always possible

feature selectionkernel tricksvm

What is the intuition behind the fact that an SVM with a Gaussian Kernel has infinite dimensional feature space?

Best Answer

This answer explains the following:

  1. Why perfect separation is always possible with distinct points and a Gaussian kernel (of sufficiently small bandwidth)
  2. How this separation may be interpreted as linear, but only in an abstract feature space distinct from the space where the data lives
  3. How the mapping from data space to feature space is "found". Spoiler: it's not found by SVM, it's implicitly defined by the kernel you choose.
  4. Why the feature space is infinite-dimensional.

1. Achieving perfect separation

Perfect separation is always possible with a Gaussian kernel (provided no two points from different classes are ever exactly the same) because of the kernel's locality properties, which lead to an arbitrarily flexible decision boundary. For sufficiently small kernel bandwidth, the decision boundary will look like you just drew little circles around the points whenever they are needed to separate the positive and negative examples:

Something like this

(Credit: Andrew Ng's online machine learning course).

So, why does this occur from a mathematical perspective?

Consider the standard setup: you have a Gaussian kernel $K(\mathbf{x},\mathbf{z}) = \exp(-||\mathbf{x}-\mathbf{z}||^2 / \sigma^2)$ and training data $(\mathbf{x}^{(1)},y^{(1)}), (\mathbf{x}^{(2)},y^{(2)}), \ldots, (\mathbf{x}^{(n)},y^{(n)})$ where the $y^{(i)}$ values are $\pm 1$. We want to learn a classifier function

$$\hat{y}(\mathbf{x}) = \sum_i w_i y^{(i)} K(\mathbf{x}^{(i)},\mathbf{x})$$

Now how will we ever assign the weights $w_i$? Do we need infinite dimensional spaces and a quadratic programming algorithm? No, because I just want to show that I can separate the points perfectly. So I make $\sigma$ a billion times smaller than the smallest separation $||\mathbf{x}^{(i)} - \mathbf{x}^{(j)}||$ between any two training examples, and I just set $w_i = 1$. This means that all the training points are a billion sigmas apart as far as the kernel is concerned, and each point completely controls the sign of $\hat{y}$ in its neighborhood. Formally, we have

$$ \hat{y}(\mathbf{x}^{(k)}) = \sum_{i=1}^n y^{(k)} K(\mathbf{x}^{(i)},\mathbf{x}^{(k)}) = y^{(k)} K(\mathbf{x}^{(k)},\mathbf{x}^{(k)}) + \sum_{i \neq k} y^{(i)} K(\mathbf{x}^{(i)},\mathbf{x}^{(k)}) = y^{(k)} + \epsilon$$

where $\epsilon$ is some arbitrarily tiny value. We know $\epsilon$ is tiny because $\mathbf{x}^{(k)}$ is a billion sigmas away from any other point, so for all $i \neq k$ we have

$$K(\mathbf{x}^{(i)},\mathbf{x}^{(k)}) = \exp(-||\mathbf{x}^{(i)} - \mathbf{x}^{(k)}||^2 / \sigma^2) \approx 0.$$

Since $\epsilon$ is so small, $\hat{y}(\mathbf{x}^{(k)})$ definitely has the same sign as $y^{(k)}$, and the classifier achieves perfect accuracy on the training data.

2. Kernel SVM learning as linear separation

The fact that this can be interpreted as "perfect linear separation in an infinite dimensional feature space" comes from the kernel trick, which allows you to interpret the kernel as an inner product in a (potentially infinite-dimensional) feature space:

$$K(\mathbf{x}^{(i)},\mathbf{x}^{(j)}) = \langle\Phi(\mathbf{x}^{(i)}),\Phi(\mathbf{x}^{(j)})\rangle$$

where $\Phi(\mathbf{x})$ is the mapping from the data space into the feature space. It follows immediately that the $\hat{y}(\mathbf{x})$ function as a linear function in the feature space:

$$ \hat{y}(\mathbf{x}) = \sum_i w_i y^{(i)} \langle\Phi(\mathbf{x}^{(i)}),\Phi(\mathbf{x})\rangle = L(\Phi(\mathbf{x}))$$

where the linear function $L(\mathbf{v})$ is defined on feature space vectors $\mathbf{v}$ as

$$ L(\mathbf{v}) = \sum_i w_i y^{(i)} \langle\Phi(\mathbf{x}^{(i)}),\mathbf{v}\rangle$$

This function is linear in $\mathbf{v}$ because it's just a linear combination of inner products with fixed vectors. In the feature space, the decision boundary $\hat{y}(\mathbf{x}) = 0$ is just $L(\mathbf{v}) = 0$, the level set of a linear function. This is the very definition of a hyperplane in the feature space.

3. Understanding the mapping and feature space

Note: In this section, the notation $\mathbf{x}^{(i)}$ refers to an arbitrary set of $n$ points and not the training data. This is pure math; the training data does not figure into this section at all!

Kernel methods never actually "find" or "compute" the feature space or the mapping $\Phi$ explicitly. Kernel learning methods such as SVM do not need them to work; they only need the kernel function $K$.

That said, it is possible to write down a formula for $\Phi$. The feature space that $\Phi$ maps to is kind of abstract (and potentially infinite-dimensional), but essentially, the mapping is just using the kernel to do some simple feature engineering. In terms of the final result, the model you end up learning, using kernels is no different from the traditional feature engineering popularly applied in linear regression and GLM modeling, like taking the log of a positive predictor variable before feeding it into a regression formula. The math is mostly just there to help make sure the kernel plays well with the SVM algorithm, which has its vaunted advantages of sparsity and scaling well to large datasets.

If you're still interested, here's how it works. Essentially we take the identity we want to hold, $\langle \Phi(\mathbf{x}), \Phi(\mathbf{y}) \rangle = K(\mathbf{x},\mathbf{y})$, and construct a space and inner product such that it holds by definition. To do this, we define an abstract vector space $V$ where each vector is a function from the space the data lives in, $\mathcal{X}$, to the real numbers $\mathbb{R}$. A vector $f$ in $V$ is a function formed from a finite linear combination of kernel slices: $$f(\mathbf{x}) = \sum_{i=1}^n \alpha_i K(\mathbf{x}^{(i)},\mathbf{x})$$ It is convenient to write $f$ more compactly as $$f = \sum_{i=1}^n \alpha_i K_{\mathbf{x}^{(i)}}$$ where $K_\mathbf{x}(\mathbf{y}) = K(\mathbf{x},\mathbf{y})$ is a function giving a "slice" of the kernel at $\mathbf{x}$.

The inner product on the space is not the ordinary dot product, but an abstract inner product based on the kernel:

$$\langle \sum_{i=1}^n \alpha_i K_{\mathbf{x}^{(i)}}, \sum_{j=1}^n \beta_j K_{\mathbf{x}^{(j)}} \rangle = \sum_{i,j} \alpha_i \beta_j K(\mathbf{x}^{(i)},\mathbf{x}^{(j)})$$

With the feature space defined in this way, $\Phi$ is a mapping $\mathcal{X} \rightarrow V$, taking each point $\mathbf{x}$ to the "kernel slice" at that point:

$$\Phi(\mathbf{x}) = K_\mathbf{x}, \quad \text{where} \quad K_\mathbf{x}(\mathbf{y}) = K(\mathbf{x},\mathbf{y}). $$

You can prove that $V$ is an inner product space when $K$ is a positive definite kernel. See this paper for details. (Kudos to f coppens for pointing this out!)

4. Why is the feature space infinite-dimensional?

This answer gives a nice linear algebra explanation, but here's a geometric perspective, with both intuition and proof.

Intuition

For any fixed point $\mathbf{z}$, we have a kernel slice function $K_\mathbf{z}(\mathbf{x}) = K(\mathbf{z},\mathbf{x})$. The graph of $K_\mathbf{z}$ is just a Gaussian bump centered at $\mathbf{z}$. Now, if the feature space were only finite dimensional, that would mean we could take a finite set of bumps at a fixed set of points and form any Gaussian bump anywhere else. But clearly there's no way we can do this; you can't make a new bump out of old bumps, because the new bump could be really far away from the old ones. So, no matter how many feature vectors (bumps) we have, we can always add new bumps, and in the feature space these are new independent vectors. So the feature space can't be finite dimensional; it has to be infinite.

Proof

We use induction. Suppose you have an arbitrary set of points $\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \ldots, \mathbf{x}^{(n)}$ such that the vectors $\Phi(\mathbf{x}^{(i)})$ are linearly independent in the feature space. Now find a point $\mathbf{x}^{(n+1)}$ distinct from these $n$ points, in fact a billion sigmas away from all of them. We claim that $\Phi(\mathbf{x}^{(n+1)})$ is linearly independent from the first $n$ feature vectors $\Phi(\mathbf{x}^{(i)})$.

Proof by contradiction. Suppose to the contrary that

$$\Phi(\mathbf{x}^{(n+1)}) = \sum_{i=1}^n \alpha_i \Phi(\mathbf{x}^{(i)})$$

Now take the inner product on both sides with an arbitrary $\mathbf{x}$. By the identity $\langle \Phi(\mathbf{z}), \Phi(\mathbf{x}) \rangle = K(\mathbf{z},\mathbf{x})$, we obtain

$$K(\mathbf{x}^{(n+1)},\mathbf{x}) = \sum_{i=1}^n \alpha_i K(\mathbf{x}^{(i)},\mathbf{x})$$

Here $\mathbf{x}$ is a free variable, so this equation is an identity stating that two functions are the same. In particular, it says that a Gaussian centered at $\mathbf{x}^{(n+1)}$ can be represented as a linear combination of Gaussians at other points $\mathbf{x}^{(i)}$. It is obvious geometrically that one cannot create a Gaussian bump centered at one point from a finite combination of Gaussian bumps centered at other points, especially when all those other Gaussian bumps are a billion sigmas away. So our assumption of linear dependence has led to a contradiction, as we set out to show.