Solved – Clustering of documents that are very different in number of words

clusteringdata mininginformation retrievallatent-semantic-analysispca

I have a corpus of 643 documents with different sizes and my goal is to cluster them according their topics and label each cluster with semantic name for its main topic.

I have tired different approaches for clustering including:

  1. Clustering based on cosine distance between the tf-idf vector of weights representing words in each document.

  2. Clustering based on finding the tf-idf of words in each document, then use LSA (Latent Semantic Analysis) to find an approximate representation of the document in the LSA.. Then I forumulate a number of query terms. Each query represent a topic and rank documents according to their similarity to the different queries.

  3. Clustering based on finding the tf-idf of words in each document. Then use PCA to find the principal components and then choose the first 10 components (contributing to more than 97% of variance between documents) and perform clustering based on it.

I use Hierarchical agglomerative Clustering and I employ the Silhouette value to get an indication for the optimal cut level of the resultant dendrogram and therefore the number of clusters.

However, I don't think that the clustering results really represent documents belonging to different topics. Most of time I end with having one very big cluster (> 200 documents) and other few clusters with small number of documents each.

I have noticed that the documents are different in their size significantly, The quantiles for the number of words / document are as follow:

25% of documents have 6 or more words.
50% of documents have 12 or more words
75% of documents have 21 or more words
3 documents have more than 200 words. (The one with maximum words have 608 words).

Do you think that the big variation for number of words can be a reason for a problem in clustering. ?? and If Yes, would you please suggest possible methods to solve such problem.

Best Answer

The TF part of the TF-IDF approach is supposed to use relative term frequencies, which equals normalizing the data by document length.

This is probably as good as you can do to counter this problem.

Topic clustering isn't trivial. There is much more to language than topics. So the clustering algorithm may well end up clustering text by e.g. ethnicity or gender or educational status of the author: these may all be different reasons for people using different words and language.

Without supervision, there is by no means a guarantee that you will have topic clusters.