Why does spectral clustering work

Relevant Background

I've recently learnt about the spectral clustering algorithm and had a hard time understanding why we do what we do. Trying to understand, I stumbled upon this great post, that referenced this article.

The article explained how to cluster we try to minimize a cut in the graph (while k-means gives us compact groups, spectral clustering will give us "connected" groups). It explained that since minimizing $\mathsf{Cut}\left(A_{1},\ldots,A_{k}\right)$ often results in isolating a few individual points, we use a different "cost" function that penalizes the cut if it is unbalanced. The formula I focused on was $$\mathsf{RatioCut}\left(A_{1},\ldots,A_{k}\right)=\sum\limits _{i=1}^{k}\frac{\mathsf{cut}(A_{i},\bar{A}_{i})}{\left|A_{i}\right|}$$

The article proved that this minimization problem is equivalent to
$$\begin{array}{c}
\underset{H\in\mathbb{R}^{n\times k}:\;H^{t}H=I}{\mathsf{argmin}}\;\mathsf{Trace}\left(H^{t}LH\right)\\
\mathsf{subject\;to}\quad H_{ij}=\begin{cases}
\frac{1}{\sqrt{\left|A_{i}\right|}} & i\in A_{j}\\
0 & i\notin A_{j}
\end{cases}
\end{array}
$$

and since this problem is NP-hard we relax the constraints: $$\underset{H\in\mathbb{R}^{n\times k}:\;H^{t}H=I}{\mathsf{argmin}}\;\mathsf{Trace}\left(H^{t}LH\right)$$
and the solution is given by $H$ whose columns are the first k eigenvectors (of the k-smallest eigenvalues).

The article stated that to return from the relaxed problem back to the clustering problem, we apply the k-means clustering algorithms on the rows of $H$, and if the i-th row was assigned to cluster j, we assign the i-th sample to cluster j.

The Question

Why does using k-means on the rows of $H$ make a good heuristic and why does it work? Is there a proof that this heuristic is a good approximation to the optimal solution? (the article I mentioned shows a "ladder graph" on which the spectral clustering algorithm gives a solution far from optimal so I'm not sure if the theorem this is a good approximation is correct)

Side Note

I'm unclear on why in the case $k=2$ we only look at the first eigenvector, while in the general case of any $k$ we look at the $k$ smallest eigenvectors. Why does $k=2$ use $k-1$ eigenvectors instead of $k$ as in the general case, and why does it start from the 2nd eigenvector?

Why this heuristics

I feel the heuristic comes from the limiting case -- when the graph has more than one connected branch. (normally the graph is connected as one branch but some links are weak which we shall cut.) The post you referred to (Why do we use the Laplacian matrix in Spectral Clustering?) say it well, when there is more than one connected branches, the eigenvectors of the lowest eigenvalues $\lambda_0=0$ could has value $1$ on each branch and $0$ on other branches.

For example the $$L=\begin{pmatrix} -1&1&0&0&0\\ 1&-1&0&0&0\\ 0&0&-2&1&1\\ 0&0&1&-2&1\\ 0&0&1&1&-2\\ \end{pmatrix}$$ Then the eigenvector of eigenvalue $0$ are spanned by $v_0=[1,1,0,0,0]^T$ and $v_1=[0,0,1,1,1]^T$. You can easily see if you applied k means clustering to rows of $[v_0,v_1]$ or a rotation of $[v_0,v_1]$, you will get 2 clusters.

Why 1 vector work

For your side note, you can see if you only use one eigenvector like $\alpha v_0 +\beta v_1$ you can also cluster the entries of this one vector to get the 2 branches. In application these eigenvectors are orthogonal to the lowest eigenvector $[1,1,1,...,1]$ thus $\alpha\beta<0$, which makes clustering easier.

Approximating the ideal case

The above is the ideal case that never happens. Normally the graph has only one branch, but we expect the eigenvectors of the smallest eigenvalues to be like these ideal ones -- having large values in one "branch" and lower values in other branches, like the indicator function of each branch.

Thus if you want $k$ clusters you'd better have $k$ eigenvectors indicating $k$ branches.

Relevant Background

The Question

Side Note

Best Answer

Why this heuristics

Why 1 vector work

Approximating the ideal case

Why starting from 2nd eigenvector

Related Question

Relevant Background

The Question

Side Note

Best Answer

Why this heuristics

Why 1 vector work

Approximating the ideal case

Why starting from 2nd eigenvector

Related Solutions

[Math] Finding a symmetric 3×3 matrix from 2 eigenvectors and 2 eigenvalues

How do we derive the spectral projector associated with a simple eigenvalue

Related Question