Solved – Clustering of 10’s of millions of high dimensional data

clusteringlarge datatext mining

I have a set of 50 million text snippets and I would like to create some clusters out of them. The dimensionality might be somewhere between 60k-100k. The average text snippet length would be 16 words. As you can imagine, the frequency matrix would be pretty sparse. I am looking for a software package / libray / sdk that would allow me to find those clusters. I had tried CLUTO in the past but this seems a very heavy task for CLUTO. From my research online I found that BIRCH is an algorithm that can handle such problems, but, unfortunately, I couldn't find any BIRCH implementation software online (I only found a couple of ad-hoc implementations, like assignment projects, that lacked any sort of documentation whatsoever).
Any suggestions?

EDIT: Here is some explanation on what "dimensionality" and "similarity" is in the context of my problem:
Each text-snippet is a sentence taken from a huge text corpus (news articles). A vector is created for each sentence by projecting each text snippet onto a N-dimensional space where each dimension corresponds to a word.
So, if a sentence is:

"I like red apples more than green apples" 

then the corresponding vector would have dimensions "I":1, "like":1, "apples":2 "more":1 etc…

I will decide which of the millions of distinct words will make up the space upon which the projection will be made. Typically the top-N most frequent words will be selected for this.

Then, the similarity will simply be the distance between the vectors of two sentences. (usually I normalize the vectors first and then take the dot product between two vectors – this is known as "cosine similarity")

Best Answer

One option is to use Latent Dirichlet Allocation (LDA) to model the underlying topics that occur in the sentences. Edwin Chen has a good explanation of LDA here.

Here's one python package that contains an LDA implementation. It handles cases that don't fit in memory as well (those yours should when its represented as a sparse matrix).

Another option is to apply kmeans to the sparse matrix. If runtime is an issue in either case, simply reduce the dimensionality of the feature vector.

What is the goal of clustering these sentences?