Solved – Is cosine similarity a classification or a clustering technique

classificationclusteringcosine similaritymachine learningtext mining

In document classification, is cosine similarity considered a classification or a clustering technique? But you need training data with the cosine similarity for creation of the centroid right?

Best Answer

No.

Cosine similarity can be computed amongst arbitrary vectors. It is a similarity measure (which can be converted to a distance measure, and then be used in any distance based classifier, such as nearest neighbor classification.)

$$\cos \varphi = \frac{a\cdot b}{\|a\| \, \|b\|} $$

Where $a$ and $b$ are whatever vectors you want to compare.

If you want to do NN classification, you would use $a$ as your new document, and $b$ as your known sample documents, then classify the new document based on the most similar sample(s).

Alternatively, you could compute a centroid for a whole class, but that would assume that the class is very consistent in itself, and that the centroid is a reasonable estimator for the cosine distances (I'm not sure about this!). NN classification is much easier for you, and less dependent on your corpus to be very consistent in itself.

Say you have the topic "sports". Some documents will talk about Soccer, others about Basketball, others about American Football. The centroid will probably be quite meaningless. Keeping a number of good sample documents for NN classification will likely work much better.

This happens commonly when one class consists of multiple clusters. It's an often misunderstood thing, classes do not necessarily equal clusters. Multiple classes may be one big cluster when they are hard to discern in the data. And on the other hand a class may well have multiple clusters if it is not very uniform.

Clustering can work well for finding good sample documents from your training data, but there are other more appropriate methods. In a supervised context, supervised methods will always perform better than unsupervised.