Optimization – Solving Problems Using Reproducing Kernel Hilbert Spaces

calculus-of-variationsconvex optimizationhilbert-spaces

I am encountering a problem concerning Reproducing Kernel Hilbert Spaces (RKHS) in the context of machine learning using Support Vector Machines (SVMs).

With refernece to this paper [Olivier Chapelle, 2006], Section 3, I will try to be brief and focused on my problem, thus I may avoid giving rigorous description of what I am using below.

Let the following optimization problem:
$$
\displaystyle \min_{\mathbf{w},b}\:
\lVert\mathbf{w}\rVert^2 + C\sum_{i=1}^{n}L(y_i, \mathbf{w}\cdot\mathbf{x}_i+b),
$$
where $L(y,t)=\operatorname{max}(0,1-yt)$ is a loss function, the so-called "hinge loss". Trying to introduce kernels, in order to consider non-linear SVMs, the author reformulates the aforementioned optimization problem, looking for a function in a RKHS, $\mathcal{H}$, that minimizes the following functional:
$$
F[f]=\lambda\lVert f \rVert^2_\mathcal{H} + \sum_{i=1}^{n}L(y_i, f(\mathbf{x}_i)+b).
$$
I understand the following of his work; my question is the following:
What if I had some other loss function (not the hinge-loss above), which is not expressed solely by the inner product $\mathbf{w}\cdot\mathbf{x}_i$, which -if I understand correctly- is "replaced" by $f(\mathbf{x}_i)$, but instead I had some loss function of the form:
$$
\mathbf{w}\cdot\mathbf{x}_i+b+\sqrt{\mathbf{w}^TA\mathbf{w}},
$$
where $A$ is a positive-definite symmetric matrix? I mean, is there any way of expressing the above quadratic form ($\sqrt{\mathbf{w}^TA\mathbf{w}}$) using the function $f$, such that I can express my optimization problem in the context of RKHS?

On the other hand, the theory suggests that, whatever the loss-function, $L$, is, the solution of the above reformulated problem would be in the form:
$$
f(\mathbf{x})=\sum_{i=1}^{n}\alpha_ik(\mathbf{x}_i, \mathbf{x}),
$$
where $k$ is the kernel associated with the adopted RKHS. Have I understood that correctly? The solution would be the above even if my loss function included terms like $\sqrt{\mathbf{w}^TA\mathbf{w}}$?

Best Answer

Beginning with the answer to your second problem, suppose that $f \in H$, where $H$ is the reproducing kernel hilbert space. Let $S$ be the subspace spanned by the kernel functions $k(x_i, \cdot)$. Then by the theory of Hilbert spaces, $f$ can be written as $f = f_S + f_P$ where $$f_S(x) = \sum_{i=1}^n a_i k(x_i, x),$$ and $f_P$ is a function perpendicular to $S$. Moreover by the Pythagorean theorem $$\| f \|^2 = \| f_S \|^2 + \| f_P \|^2.$$ In particular this tells us that $\|f\| > \|f_S\|$ if $f_P \neq 0$.

Now consider $f(x_i)$, which can be written as $$f(x_i)=\langle f, k(x_i, \cdot) \rangle = \langle f_S, k(x_i, \cdot) \rangle + \langle f_P, k(x_i, \cdot) \rangle = \langle f_S, k(x_i, \cdot) \rangle + 0 = \langle f_S, k(x_i, \cdot) \rangle = f_S(x_i)$$

Thus for every $f$ we have $$\sum_{i=1}^n L(y_i, f(x_i) + b) = \sum_{i=1}^n L(y_i, f_S(x_i) + b)$$

Hence, $$F[f] = \lambda \| f\|^2 + \sum_{i=1}^n L(y_i, f(x_i) + b) > \lambda \| f_S\|^2 + \sum_{i=1}^n L(y_i, f_S(x_i) + b) = F(f_S)$$ and this holds for all $f \in H$. This means if a function is going to minimize $F$, it must be in the subspace $S$ and is a linear combination of kernel functions.


As for the first question, quadratic terms resembling $w^T A w$ appear through what is known as the Graham matrix, which is made from kernels: $$K = \left( k(x_i,x_j) \right)_{i,j=1}^n.$$ It is straightforward to prove that this matrix is symmetric and positive (semi)-definite, since if $a = (a_1, a_2, ..., a_n)$ then $$a^T K a = \left\langle \sum_{i=1}^n a_i k(x_i, \cdot), \sum_{j=1}^n a_j k(x_j, \cdot)\right\rangle=\left\|\sum_{i=1}^n a_i k(x_i, \cdot)\right\|^2$$

This gives us our first hint at how to recast $w^T A w$ into our language of reproducing kernel hilbert spaces.


Take for instance $$A = diag(a_1,a_2,a_3,..., a_n)$$ where each $a_i > 0$. Then $$w^T A w = \sum_{i=1}^n a_i w_i^2$$

Now imagine replacing $w$ with $f$, and each $w_i = f(x_i)$. Then $$\sum_{i=1}^n a_i w_i^2 = \sum_{i=1}^n a_i f(x_i)^2$$

By the same reasoning above, $$\sum_{i=1}^n a_i f(x_i)^2 = \sum_{i=1}^n a_i f_S(x_i)^2$$, and so we may add this to the loss function and still be guaranteed that a minimizer will be a linear combination of kernel functions.

So in short, you may introduce the term you want into your loss function. Here keeping in mind that $w = (f(x_1), f(x_2),...,f(x_n))$.

Related Question