Solved – the interpretation of eps parameter in DBSCAN clustering

clusteringdbscanhierarchical clusteringmachine learningspatial

I want to cluster lat-long data such that all clusters formed will have radius<=1000 meters

Questions

  1. What is the actual meaning of eps parameter? Please given an example.
  2. Will setting eps=1000 serve my purpose if distance measure is haversine in meters?

I understand that minpts parameter is the cluster size.

Best Answer

Epsilon is the local radius for expanding clusters. Think of it as a step size - DBSCAN never takes a step larger than this, but by doing multiple steps DBSCAN clusters can become much larger than eps.

If you want your "clusters" to have a maximum radius, that is a set cover type of problem, so you will probably want a greedy approximation. It's not a clustering problem, because you do not allow the clustering algorithm to discover structure larger than that. You want to approximate your data with a cover, ignoring structure.

But there are some clustering algorithms where you can bound the cluster radius (but they probably won't try hard enough to optimize for your problem):

  1. LEADER is kind of like DBSCAN minus the cluster expansion. Choose an unclustered point and add everything within a radius of x. Repeat until all points are "clustered". It does not optimize anything, and you do not get a whole lot of theoretical properties. But the maximum distance in a cluster is 2x. Run it twice and you would get very different results.
  2. Complete-link HAC after cutting the dendrogram at height x, that is the maximum distance of two points. The results should be much better than Leader's, and more stable. Nevertheless, complete-link HAC may not find he optimum. 3 CLINK is a faster variant of complete Link (just O(n²) rather than n³) but tends to find much worse solutions. You may want to run this several times on permutation of your data.
Related Question