Cosine similarity is not a clustering technique. It's a common distance measure for sparse vectors all over the place, in information retrieval and classification maybe even more than in clustering.
I do not have the impression that you really have understood clustering. It is an unsupervised knowledge discovery technique. As it is unsupervised, you cannot "direct" it towards building a "sports" and a "non-sports" cluster. It might just as well find an "Obama" cluster and a "non-Obama" cluster.
If you are interested in Sports as opposed to non-Sports, you are doing classification. And yes, you may use cosine distance in classification!
- Classification is when you want to assign instances the appropriate class of your known types.
- Clustering is when you have no clue of what types there are, and you want an algorithm to discover what (if any!) types there might be. This may involve a lot of trial and error, as the algorithms may find clusters that are not interesting to you.
A clustering algorithm may find clusters such as "Sentences containing the word Banana" (most likely it will not give you this explanation though!), and it hasn't failed. It's a mathematically valid cluster, and how is the algorithm supposed to know that you don't like Bananas?
Are you sure that clustering big data is actually used anywhere?
As far as I can tell, it is not used. Everybody uses classification, nobody uses clustering. Because the clustering problem is much harder, and will require manual analysis of the results.
K-means: the usual Lloyd algorithm is naive parallel, and thus trivial to implement on Hadoop. But at the same time, it does not make sense to use k-means on big data. The reason is simple: there is no dense vector big data. K-means works well for say up to 10 dimensions. With double precision, I need 80 bytes per record then. A modest computer with 1 GB of RAM can then already fit some 13 million vectors into main memory. I have machines with 128 GB of RAM...
So you will have a hard time coming up with a real data set where:
- I run out of memory on a single computer.
- k-means produces notable results. (On high dimensional data, k-means is usually only as effective as random voronoi partitions!)
- the result improves over a sample.
The last point is important: k-means computes means. The quality of a mean does not infinitely improve when you add more data. You only get marginal changes (if the result is stable, i.e. k-means worked). Most likely, your distributed computation already lost more precision on the way than you gain in the end...
Now for DBSCAN: I'm not aware of a popular distributed implementation. Every now and then a new parallel DBSCAN is proposed, usually using grids, but I've never seen one being used in practise or publicly available. Again, there are problems with the availability of interesting data where it would make sense to use DBSCAN.
- For big data, how do you set the
minPts
and epsilon
parameters? If you get this wrong, you won't have any clusters; or everything will be a single large custer.
- If your data is low-dimensional, see above for k-means. Using techniques such as R*-trees and grids, a single computer can already cluster low-dimensional data with billions of points using DBSCAN.
- If you have complex data, where indexing no longer works, DBSCAN will scale quadratically and thus be an inappropriate choice for big data.
Many platforms/companies like to pretend they can reasonably run k-means on their cluster. But fact is, it does not make sense this way, and its just maketing and tech demo. That is why they usually use random data to show off, or the dreaded broken KDDCup1999 data set (which I still can cluster faster on a single computer than on any Hadoop cluster!).
So what is really done in practise
- The Hadoop cluster is your data warehouse (rebranded as fancy new big data).
- You run distributed preprocessing on your raw data, to massage it into shape.
- The preprocessed data is small enough to be clustered on a single computer, with more advanced algorithms (that may even scale quadratic, and do not have to be naive parallel)
- You sell it to your marketing department
- Your marketing department sells it to the CSomethingO.
- Everybody is happy, because they are now big data experts.
Best Answer
Clustering is unsupervised and as such does not use any labels.
Zero-shot learning is a form of learning which does not conform to standard supervised framework: the classes are not assumed to be known beforehand (you evaluate on unseen classes), but it does assume some relationships between classes (for example some methods assume encodings for classes, then you can think that's something like embeddings for classes, for example you use word embeddings for some words instead of one-hot encoded matrix).
The main advantage of this approach lies in the fact that it is able to leverage structure that exists on classes - this enables these methods to work where standard supervised learning methods fail - handling unseen/very small classes is a typical example.