Solved – How are graphs of k-nearest neighbors built? (for clustering)

clusteringgraph theory

I’ve seen that there are several clustering algorithms (for example, CHAMELEON or even Spectral Clustering) that work by converting the data into a weighted (or sometimes unweighted) k-nearest neighbor graph based on the distances between points/observations/rows and I was wondering how these graphs are generated.

Are these graphs directed? If a point A has another point B as a near neighbor but point B doesn’t have point A as a near neighbor, is an edge still drawn? How are weights computed?

Best Answer

Any normalised (dis)similarity matrix can be converted to the adjacency matrix of an undirected graph (weighted or not). For an unweighted graph you'll want to empirically set a threshold to its adjacency matrix, i.e. a minimum similarity value for a connection to take place between two nodes. For a given partition of the graph, the modularity metric will quantify the total strength of its clusters, therefore by maximising modularity you get the optimal community structure corresponding to that graph (clustering).

To answer your questions:

  • The graph in question will be undirected as long as your similarity matrix is symmetric.
  • Weights (when present) are the normalised similarity values.

The modularity function is basically the objective function of a NP-hard combinatorial problem. There are plenty of (meta)heuristics that do the task and if I'm not mistaken the normalised cut algorithm used in spectral clustering is one of them. I don't have experience with Chameleon but the concept of maximising intracluster similarity while minimising intercluster similarity is the same in modularity optimisation.

Unfortunately there is no package (that I know of) that can automate the adjacency matrix conversion since finding the optimal threshold is a manual process. However, once you have that matrix, R and Mathematica have great packages to do the rest.

Related Question