Solved – What are the differences between document classification and clustering when working with a single topic

classificationclusteringtext mining

I am doing some web page clustering work and I'm going to use cosine similarity as my distance measure. Even though cosine similarity is a clustering technique, I have to give training data in order to build the query vector. Clustering algorithm doesn't need training data in the sense of with labeled classes, but how do you build the query vector if you don't give the training data in the cosine similarity calculation?

I am only interested in a single topic (sports) so if I do it with 2 clusters, when a new document is fed, if it is clustered to the cluster 1 (say sports), then I'll take that document or else it will be rejected. In this case how this differs from single-class classification?

Best Answer

Cosine similarity is not a clustering technique. It's a common distance measure for sparse vectors all over the place, in information retrieval and classification maybe even more than in clustering.

I do not have the impression that you really have understood clustering. It is an unsupervised knowledge discovery technique. As it is unsupervised, you cannot "direct" it towards building a "sports" and a "non-sports" cluster. It might just as well find an "Obama" cluster and a "non-Obama" cluster.

If you are interested in Sports as opposed to non-Sports, you are doing classification. And yes, you may use cosine distance in classification!

  • Classification is when you want to assign instances the appropriate class of your known types.
  • Clustering is when you have no clue of what types there are, and you want an algorithm to discover what (if any!) types there might be. This may involve a lot of trial and error, as the algorithms may find clusters that are not interesting to you.

A clustering algorithm may find clusters such as "Sentences containing the word Banana" (most likely it will not give you this explanation though!), and it hasn't failed. It's a mathematically valid cluster, and how is the algorithm supposed to know that you don't like Bananas?