Solved – Text Clustering using TF-IDF and Cosine Similarity

cosine similarityhierarchical clusteringtext mining

I am attempting to perform hierarchical clustering using (Tf-Idf & cosine distance) on about 25,000 documents that vary in length between 1-3 paragraphs each.

With the method above, my question is, should I leave all terms in my matrix and perform the TF-IDF calculation? I understand a lot of times with these situations, removing sparse terms reduces your matrix substantially, however isn't the less frequent words the ones it would give more weight to?

Best Answer

Usually (in my experience) it does make sense to exclude some of the terms.

These terms are usually very frequent functional words (like "a", "the", "will") or very infrequent ones, and they are typically do not have any discriminative power - that is, they are not helpful when deciding if a document should belong to a cluster or not.

I usually use a list of stopwords for excluding too frequent words and count-based filtering for very infrequent words.

If you use sklearn, you can include the filtering to your vectorizer:

TfidfVectorizer(..., min_df=5)

This will exclude all terms that do not appear in at least 5 documents.

Then you can improve it further. For example, some of discarded infrequent words may be typos, so sometimes it may make sense to do spelling correction before vectorizing documents.