Convergence theorems for Kernel SVM and Kernel Perceptron

machine learningoptimizationreproducing-kernel-hilbert-spaces

Context

Some time ago I asked whether SVMs could work on arbitrary Hilbert spaces, my motivation for asking it was due to my discomfort towards the kernelized version of SVM, which, in my mind, requires that we "restart" SVM in a different Hilbert space. This kernelized approach (again, in my mind) works fine if this different space is just a higher dimensional $\mathbb{R^n}$ but breaks down if it's something different. The answer in that question pointed out the importance of representer theorems in guaranteeing that the minimum of the associated risk functional lies in a finite dimensional subspace of the RKHS. This gave me a sense of direction, however I still feel confused about the whole story. Maybe it's because of my poor understanding of Optimization Theory and Statistical Learning Theory, and so I've decided to lower my ambition a bit by trying to justify the inner workings of the kernelized binary Perceptron.

I'll give my take on the algorithm and then ask some questions at the end.

The Perceptron

Let $x_1, x_2, …, x_n$ be points in $\mathbb{R}^d$ and $y_1, y_2, …, y_n$ be negative/positive labels in $\{-1,1\}$. Suppose this dataset is linearly separable, then, the algorithm works as follows:

Let $w \leftarrow 0, b \leftarrow 0$, and $R \leftarrow \max \lvert\lvert x_i \rvert\rvert$
Loop until there are no misclassifications:
$\quad$Loop from $i=1$ to $n$:
$\qquad$If $y_i(\langle w, x_i \rangle + b)\leq 0$, then
$\quad\qquad$$w \leftarrow w + y_ix_i$
$\quad\qquad$$b \leftarrow b + y_iR^2$

It's possible to give an argument showing the algorithm converges. I'll show a simplified version I found to generalize easily (the classic proof builds augmented vectors $\hat{w}$ and $\hat{x_i}$, but I wasn't able to adapt it to abstract vectors and abstract inner products). In what follows, assume $b=0$ so that the dataset is linearly separable through the origin.

Correctness and convergence upper bounds

Let $(x_1, y_1), (x_2, y_2), …, (x_n, y_n)$ be a linearly separable dataset such that there is $w^*$ that correctly classifies the dataset. That is, $y_i\langle w^*, x_i \rangle >\gamma$ for some margin $\gamma$. We can also suppose $w^*$ is unitary so that the margin is the geometric margin. Under these conditions, the number of mistakes $t$ made by the Perceptron before it halts is bounded above as follows:

$$t \leq \left( \frac{R}{\gamma} \right)^2$$

For the proof, suppose we let the Perceptron run on the given dataset. Its execution defines a sequence of weight vectors $w_t$, where $w_t$ is the updated weight vector after the $t$-th misclassification and $w_0 = 0$. By definition, we have

$$w_{t} = w_{t-1} + y_ix_i$$

where $i$ comes from the misclassified example. Using the recursive definition, we get

$$\langle w_t, w^* \rangle = \langle w_{t-1}, w^* \rangle + y_i\langle x_i, w^*\rangle$$

As $w^*$ correctly classifies $x_i$, we obtain

$$\langle w_{t-1}, w^* \rangle + y_i\langle x_i, w^*\rangle \geq \langle w_{t-1}, w^* \rangle + \gamma$$

By repeating this until we reach $w_0$, we end up with

$$\langle w_t, w^* \rangle \geq t\gamma$$

We now seek an inequality in the other direction; we do so by bounding $\lvert\lvert w_t \rvert\rvert$

$$\lvert\lvert w_t \rvert\rvert^2 = \lvert\lvert w_{t-1} \rvert\rvert^2 + 2\langle w_{t-1}, y_ix_i \rangle + \lvert\lvert x_i \rvert\rvert^2$$

Since $w_{t-1}$ misclassified $x_i$, we know the middle expression is less than or equal to zero. Hence,

$$\lvert\lvert w_{t} \rvert\rvert^2 \leq \lvert\lvert w_{t-1} \rvert\rvert^2 + \lvert\lvert x_i \rvert\rvert^2 \leq \lvert\lvert w_{t-1} \rvert\rvert^2 + R^2$$

Applying this multiple times and then taking square roots, we get

$$\lvert\lvert w_{t} \rvert\rvert \leq \sqrt{t}R$$

We can now chain the inequalities together

$$t\gamma \leq \langle w_t, w^* \rangle \leq \lvert\lvert w_t \rvert\rvert \lvert\lvert w^* \rvert\rvert \leq \sqrt{t}R$$

By squaring the inequalities and cancelling $t$, we obtain

$$t\gamma^2 \leq R^2$$

Which finally leads us to the desired inequality

$$t \leq \frac{R}{\gamma}^2$$

Note these steps work on any inner product space, which leads me to conclude the binary Perceptron works if $X = \ell^2$ or even some crazy set-theoretic contraption.

Dual Algorithm and feature maps

If one inspects the listed algorithm closely, one can see that $w$ is a linear combination of the training points $x_i$. That is, $w = \sum_{i=1}^n \alpha_ix_i$, where the $\alpha_i$ are integers. Thus, instead of searching for $w$ directly, we can find the $\alpha_i$ instead. See the dual algorithm below:

Let $w \leftarrow 0, b \leftarrow 0$, and $R \leftarrow \max \lvert\lvert x_i \rvert\rvert$
Loop until there are no misclassifications:
$\quad$Loop from $i=1$ to $n$:
$\qquad$If $y_i(\sum_{j=1}^n \alpha_j \langle x_j, x_i \rangle + b)\leq 0$, then
$\quad\qquad$$\alpha_i \leftarrow \alpha_i + y_i$
$\quad\qquad$$b \leftarrow b + y_iR^2$

Note that the only important quantity here is $\langle x_i, x_j \rangle$. This lets us make use of feature maps to classify non-linearly separable data. Indeed, we can just select another Hilbert space $H$ and let $\phi: X \to H$, and the Perceptron will work just fine if we make sure $\phi(X)$ is linearly separable in $H$. In this setting, $f(x) = \sum_{i=1}^n \alpha_i \langle \phi(x_i), \phi(x) \rangle + b$

Kernels

Through the derivation above becomes apparent that maps of the form $K(x,y) = \langle \phi(x), \phi(y) \rangle$ are important. If we could find efficient maps of these form, we would be able to sidestep computing the feature map. We call these maps kernels, and through the theorem of Moore-Aronszajn, it can be proved that these maps are precisely the symmetric and positive-definite functions from $X\times X \to \mathbb{R}$. One can go on creating kernels like the polynomial and gaussian (RBF) kernels and then applying them without worries. There is one caveat to this whole discussion though.

Solving the curse of dimensionality and local minima

It is a known fact which goes by the name of curse of dimensionality, that the higher the dimension of your data, the harder it becomes to learn. This is because you'd need way more training examples to obtain a representative data set; i.e., one that covers a good chunk of the input space. It is because of this issue that generalization theories like the one of Vapnik were investigated. With them, one is able to find the important factors (such as the concept of the margin) that must be controlled so that hypothesis generalizes well. These strategies give rise to algorithms, which, if well planed, can side steps issues of local minima (they can exploit convexity, for example).

Final discussion and questions

I'm pretty happy with what I wrote above. However, to me, the kernelized version of the Perceptron depends crucially on the convergence theorem also working for abstract Hilbert spaces. This doesn't seem to be a worry in other presentations I have found, as can be seen by the standard proof drawing heavily on $X$ being $\mathbb{R}^d$ and $\langle \cdot, \cdot \rangle$ being the dot product. Another thing that worries me is that I haven't explicitly used the representer theorem. So,

  1. If one can use any kernel with kernelized Perceptron and SVM, how can we be sure the algorithms still work when the induced feature space is weird. I mean, I wouldn't want to put assumptions on which are the "usable" feature spaces. It could be any RKHS. Thus, wouldn't it be necessary to give convergence theorems that work on any RKHS?

  2. In moving from the K-Perceptron to K-SVM, I feel the same problem would arise. OK, I get that we can formulate the minimization problem of SVM in terms of a functional and I get the representation theorem would hint a dual version of the optimization problem. What I don't get is how can we be sure the algorithm that performs the optimization still works on the induced feature space (the RKHS).

Can anyone shed some light on these issues?

Thanks!

Best Answer

In the proof that you have written, only the properties of inner product and norm are used, which hold for any Hilbert space. So, as you have already pointed out, the proof carries straightly to a Hilbert space.

When the data is not linearly separable, a specific kernel function $K(\cdot, \cdot)$ is used to map the data to a Hilbert space, where the linear separability is more likely to hold. The proofs of convergence uses the properties of norms and inner products, and hence applicable to any Hilbert space.

Note that the Euclidean space $\mathbb{R}^d$ is a finite dimensional Hilbert space. The principal difference between $\mathbb{R}^d$ and an arbitrary Hilbert space, in terms of validity of proofs of any mathematical result, is the property of compactness. In $\mathbb{R}^d$, any closed and bounded set is compact, but it is not true for an infinite dimensional Hilbert space. As long as a proof or derivation in $\mathbb{R}^d$ does not use the property of compact sets, it will carry straight to a Hilbert space.

Using the validity of the kernel SVM, it has already been extended to the case where the observations are elements of an infinite dimensional Hilbert space, e.g., random functions. For example, see the papers here.

Related Question