Solved – How to evaluate a clustering/unsupervised learning problem with massive amounts of data, with labels only for a small fraction of points

clusteringmachine learningunsupervised learning

I'm wondering if anybody can point me to work on the evaluation of unsupervised learning where there are a very large (say hundreds of millions) number of points and manual labelling can only ever be obtained for a small fraction of the points.

To elaborate, most clustering evaluations assume gold standard labels for the data points. While I can feasibly obtain this for a small number of points (say in the thousands, or perhaps tens of thousands) this is a small fraction of the total. Furthermore it is obviously not possible to cluster only the labelled points, since the presence of all the unlabelled data would fundamentally change the shape of the solution. The labels are also not well defined (they are a function of the items contained in the cluster) so assigning manual labels to points individually is unlikely to be consistent.

One possibility that occurred to me is to label the pairs of points as to whether they should be linked or not: however, chosen at random two points have a vanishingly small chance of being linked, so the true positive rate on the links would have very high variance.

I would appreciate if anyone knows of work in a similar area that they could point me to, or can describe an evaluation strategy known to work in such circumstances.

EDIT: One thing that makes my application hard to evaluate, besides the scale, is the number of clusters. Typically I'll have perhaps an order of magnitude fewer clusters than data points (on average 10 points per cluster), and so random sampling of the points to label will lead to almost no sampled points in the same cluster. This means getting creative with sampling a single point, and then evaluating in its neighbourhood, is likely to be much more productive but also potentially introduce more bias.

Best Answer

All the clustering evaluation measures I've seen an be computed on a sample only.

So you could measure quality by how well it agrees with your reference clustering on the labeled data only. Have a look at ARI, for example. It's straightforward to compute it on a subset only.

The question is whether this does help solve you an actual problem. If you overfit on your labels, you might as well use classification; and classification will always be better.