Solved – Why eigenvectors reveal the groups in Spectral Clustering

clusteringeigenvaluesspectral analysis

According to Handbook of Cluster Analysis Spectral Clustering is done with following algorithm:

Input Similarity Matrix $S$, number of clusters $K$

  1. Form the transition matrix $P$ with $P_{ij} = S_{ij} / d_i$ for $i,j = 1:n$ where $d_i= \sum_{j=1}^n S_{ij}$

  2. Compute the largest $K$ eigenvalues $\lambda_1 \ge \dots \ge \lambda_k$ and eigenvectors $v_1 \ge \dots \ge v_k$ of P.

  3. Embed the data in K-th principal subspace where $x_i = [v_i^2 v_i^3 \dots v_i^k]$ for $i = 1 \dots n$

  4. Run the K-means algorithm on the $x_{1:n}$.

It is followed by artificial example with the following graphics of eigenvalues and formed lower dimensional representation of points:

enter image description here

However, I can't get is what property of eigenvectors makes them reveal the groups so that they can be clustered using simple K-means. I don't get the eigenvectors graphics either – althought the third eigenvector seems to have three levels of observations, each lower than the previous one, the second one seems to have only two – coming back to the high value for the observations at the end of the sample. How does it relate to the clusters found?

I think I am missing some linear algebra knowledge. So my question is why does it work? Why we take the largest k eigenvalues and their respective eigenvectors?

Best Answer

This is a great, and a subtle question.

Before we turn to your algorithm, let us first observe the similarity matrix $S$. It is symmetrical and, if your data form convex clusters (see below), and with an appropriate enumeration of the points, it is close to a block-diagonal matrix. That's because points in a cluster tend to have a high similarity, and points from different clusters a low one.

Below is an example for the popular "Iris" data set:

Similarity matrix for the Iris dataset

(there is a noticeable overlap between the second and the third cluster, therefore the two blocks are somewhat connected).

You can decompose this matrix into its eigenvectors and the associated eigenvalues. This is called "spectral decomposition", because it is conceptually similar to decomposing light or sound into elementary frequencies and their associated amplitudes.

The definition of eigenvector is:

$$ A \cdot e = e \cdot \lambda $$

with $A$ being a matrix, $e$ an eigenvector and $\lambda$ its corresponding eigenvalue. We can collect all eigenvectors as columns in a matrix $E$, and the eigenvalues in a diagonal matrix $\Lambda$, so it follows:

$$ A \cdot E = E \cdot \Lambda $$

Now, there is a degree of freedom when choosing eigenvectors. Their direction is determined by the matrix, but the size is arbitrary: If $A \cdot e = e \cdot \lambda$, and $f = 7 \cdot e$ (or whatever scaling of $e$ you like), then $A \cdot f = f \cdot \lambda$, too. It is therefore common to scale eigenvectors, so that their length is one ($\lVert e \rVert_2 = 1$). Also, for symmetric matrices, the eigenvectors are orthogonal:

$$ e^i \cdot e^j = \Bigg\{ \begin{array}{lcr} 1 & \text{ for } & i = j \\ 0 & \text{ otherwise } & \end{array} $$

or, in the matrix form:

$$ E \cdot E^T = I $$

Plugging this into the above matrix definition of eigenvectors leads to:

$$ A = E \cdot \Lambda \cdot E^T $$

which you can also write down, in an expanded form, as:

$$ A = \sum_i \lambda_i \cdot e^i \cdot (e^i)^T $$

(if it helps you, you can think here of the dyads $e^i \cdot (e^i)^T$ as the "elementary frequencies" and of the $\lambda_i$ as the "amplitudes" of the spectrum).

Let us go back to our Iris similarity matrix and look at its spectrum. The first three eigenvectors look like this:

Top 3 eigenvectors of the Iris similarity matrix

You see that in the first eigenvector, the first 50 components, corresponding to the first cluster, are all non-zero (negative), while the remaining components are almost exactly zero. In the second eigenvector, the first 50 components are zero, and the remaining 100 non-zero. These 100 correspond to the "supercluster", containing the two overlapping clusters, 2 and 3. The third eigenvector has both positive and negative components. It splits the "supercluster" into the two clusters, based on the sign of its components. Taking each eigenvector to represent an axis in the feature space, and each component as a point, we can plot them in 3D:

Points in the feature space

To see how this is related to the similarity matrix, we can take a look at the individual terms of the above sum. $\lambda_1 \cdot e^1 \cdot (e^1)^T$ looks like this:

The first block in the Iris similarity matrix

i.e. it almost perfectly corresponds to the first "block" in the matrix (and the first cluster in the data set). The second and the third cluster overlap, so the second term, $\lambda_2 \cdot e^2 \cdot (e^2)^T$, corresponds to a "supercluster" containing the two:

The second+third block in the Iris similarity matrix

and the third eigenvector splits it into the two subclusters (notice the negative values!):

Splitting into the second and the third cluster

You get the idea. Now, you might ask why your algorithm needs the transition matrix $P$, instead of working directly on the similarity matrix. Similarity matrix shows these nice blocks only for convex clusters. For non-convex clusters, it is preferable to define them as sets of points separated from other points.

The algorithm you describe (Algorithm 7.2, p. 129 in the book?) is based on the random walk interpretation of clustering (there is also a similar, but slightly different graph cut interpretation). If you interpret your points (data, observations) as the nodes in a graph, each entry $p_{ij}$ in the transition matrix $P$ gives you the probability that, if you start at node $i$, the next step in the random walk would bring you to the node $j$. The matrix $P$ is simply a scaled similarity matrix, so that its elements, row-wise (you can do it column-wise, too) are probabilities, i.e. they sum to one. If points form clusters, then a random walk through them will spend much time inside clusters and only occasionally jump from one cluster to another. Taking $P$ to the power of $m$ shows you how likely you are to land at each point after taking $m$ random steps. A suitably high $m$ will lead again to a block-matrix-like matrix. If $m$ is to small, the blocks will not form yet, and if it is too large, $P^m$ will already be close to converging to the steady state. But the block structure remains preserved in the eigenvectors of $P$.

Related Question