Solved – Clustering algorithms for extremely sparse data

algorithmsclusteringk-meanssparse

I am trying to cluster an extremely sparse text corpus, and I know the number of clusters (my data is the title and author list of scientific publications, for which I already know the number of categories).

Each entry in my corpus has between 5 and 20 features; the whole corpus has 80000 samples and between 5000–120000 features (I can filter out some features that occur very rarely, or those that occur extremely frequently). As you can see, the data are extremely sparse.

I am trying to identify the clusters by creating a TF-IDF matrix of the data and running k means on it. The algorithm completely fails, i.e. it puts more than 99% of the data in the same cluster. I am using Python scikit-learn for both steps. Here is some sample code (on data that actually works), in case it will help someone:

    from sklearn.cluster import KMeans 
    from sklearn.feature_extraction.text import TfidfVectorizer
    vectorizer = TfidfVectorizer()
    data = ['aa ab','aa ab','ac a_b','ac a_b','bc ba', 'bc ba', 'ba bc'] 
    vectorized = vectorizer.fit_transform(data)
    km = KMeans(n_clusters=3, init='random', n_init=1, verbose=1)
    km.fit(vectorized)
    print km.labels_

    [1 1 0 0 2 2 2]

My question is: Is there a better alternative to TF-IDF followed by k means for this problem? Does it make sense to start looking for different distance metrics on my TF-IDF data (e.g. cosine similarity), or am I bound to fail due to lack of data?
Thanks!

Best Answer

For text vectors, there are good known similarities - cosine, and the variations used e.g. in Lucene for text retrieval.

k-means, however, may be a bad fit. Because the means computed will not have a realistic sparsity, but will be much more dense.

Anyway, there exist some k-means variations for text, such as spherical k-means. You may want to try CLUTO, which seems to be a more popular tool for clustering text.

Hierarchical clustering is probably a good candidate, too. But it does not scale to large data sets, as the usual implementations are $O(n^3)$. For 80000 documents, this will take quite some time.

Related Question