Solved – Clustering without a distance matrix

clusteringdbscanoptimizationscikit learn

I've recently completed a project where I used scikit-learn's DBSCAN module to find clusters in relatively short strings of text. I used a custom string similarity metric to allow for vectorized computation of an $n^2$ similarity matrix.

I know that it is possible to reduce the time complexity of DBSCAN to $O(n \, \log n)$ by using an appropriate ($O(\log n)$ lookup time) indexing structure for finding neighbors of a given data point. DBSCAN Time complexity

How can I accomplish this when my data are binary feature vectors of length 1296? Is there an efficient way to index lookups for points with an arbitrary number of dimensions? If so, is this functionality built into scikit-learn or do I need to roll my own solution?

Best Answer

Inverted lists work very well for sparse data. This is what e.g. Lucene uses.

I don't know how extensible scikit-learn is. A lot of the code in it seems to be written in Cython, so it is Python-like code compiled via C. This would make it harder to extend.

ELKI, the data mining tool I am contributing a lot to, has an - yet unpublished and undocumented - Lucene addon. This would likely work for you. I hope to at some point also have an inverted index for sparse vectors in ELKI main (because of the Lucene dependency, I plan on keeping this addon separate).

We also have (non integrated) code for a prefix-tree index for accelerating Levenshtein distance. But this needs some more work to integrate it, and maybe some profiling.

Most of the time, indexes only work for a particular distance. There is no general purpose index that can support arbitrary distances at the same time. There are some indexes (e.g. M-tree, and iDistance, both available in ELKI) that do work with arbitrary distances, but only one distance at a time. But how well they work for your data and distance varies a lot. Usually, you need a good numerical contrast on your similarities.

The question you need to ask yourself is: is there a way to find all objects within a radius of $\varepsilon$ (or a similarity larger than $\varepsilon$) without comparing every object to every other object.

Note that for DBSCAN you can use fake distances. The actual distances are not used; only a binary decision is needed ($d\leq\varepsilon$). This is formalized as GeneralizedDBSCAN. So if your can implement a "distance function" that returns 0 for "similar" and 1 for "not similar", and plug this into scikit-learn's DBSCAN, you should be fine. Depending on the architecture of scikit-learn, you maybe can plug in a custom index disguised as distance function. Inverted lists are a good candidate for binary data.

Related Question