Solved – Formulation for kernel k-means

clusteringk-meanskernel trick

Given $n$ points ${\boldsymbol{x_1},…,\boldsymbol{x_n}}$ and a nonlinear feature map $\Phi$, the kernel k-means problem is formulated as follows,

$\underset{\boldsymbol{t}, \; \boldsymbol{\mu}_k \; k = 1,…,K}{\text{minimize}} \; \sum_{i=1}^{n} \sum_{k=1}^{K} t_{ik} \| \Phi(\boldsymbol{x_i}) \; – \; \boldsymbol{\mu}_k \|_2^2$

where the matrix $\boldsymbol{t}$ is made of $t_{ik} \in \{0,1\}$, which are binary indicator variables and it associates each point $\boldsymbol{x}_i$ to only one cluster $j$.

The $j$th cluster center is computed as

$\boldsymbol{\mu}_j = \frac{1}{|\mathcal{S_j}|} \sum_{i \in \mathcal{S_j}} \Phi(\boldsymbol{x_i})$.

In each iteration, the cluster $j$ to which a point $\boldsymbol{x_i}$ belongs is found as follows,

$\underset{j}{\arg\min} \;\; \Phi(\boldsymbol{x_i}) \cdot \Phi(\boldsymbol{x_i}) \; – \; \frac{2\sum_{k \in \mathcal{S_j}} \Phi(\boldsymbol{x_i}) \cdot \Phi(\boldsymbol{x_k})}{|\mathcal{S_j}|} \; + \; \frac{\sum_{l,k \in \mathcal{S_j}} \Phi(\boldsymbol{x_l}) \cdot \Phi(\boldsymbol{x_k})}{(|\mathcal{S_j}|)^2}$

I understand the reasoning behind this approach. In this approach the cluster centers are maintained in feature space.

It is possible to think of another approach as described below. In this approach the cluster $j$ associated with a point $\boldsymbol{x}_i$ can be found as follows,

$\underset{j}{\arg\min} \; \| \Phi(\boldsymbol{x_i}) \; – \; \Phi(\boldsymbol{m_j}) \|_2^{2}$

The above step will again require kernel based dot product operations. However, the cluster centers are maintained in the $\it{original}$ space and are updated as follows,

$\boldsymbol{m}_j = \frac{1}{|\mathcal{S}_j|} \sum_{i \in \mathcal{S}_j} \boldsymbol{x}_i$

I would like to understand more rigorously, why the second approach is less preferable or wrong and why kernel k-means is formulated purely in the feature space.

One possible explanation:

In k-means, the crucial step is the estimation of the barycenter of each cluster because that is what minimizes the variance of each cluster. It has a simple form (mean) in the feature space for kernel k-means. Finding mean in the original space is not meaningful.

Best Answer

An example with an explicit feature space reveals why that might be a bad idea.

Consider $x \in \mathbb R$ and $\boldsymbol\phi(x)=(x,x^2)$. Now consider you have a data set with four points, $-1$ and $1$ in cluster 1, $-3$ and $3$ in cluster 2. Drawing these on the real line will make it clear that they cannot be clustered correctly with a simple algorithm.

However, \begin{align} \Phi(1) = (1,1) \\\Phi(-1) = (-1,1) \\\Phi(3) = (3,9) \\\Phi(-3) = (-3,9) \end{align}

In feature space, these can be correctly clustered with a Lloyd-type algorithm. Computing the assignments with the kernelized scheme you described is equivalent. However, the centroids in the original space are \begin{align} m_1=\frac{1-1}{2} = 0 \\m_2=\frac{3-3}{2} = 0 \end{align} So the centroids can become meaningless in the original space.

Related Question