Solved – How to compute term frequency and find clusters in a dataset composed of strings

clusteringinformation retrieval

I am currently looking for some Information Retrieval techniques.

I have a SQL database table containing strings. It has 1000 records, each being a random sentence I picked from random web sites. I need to get the term frequency and represent each string into a vector. I also need to cluster the records, e.g. using k-means.

Does anyone know what is the best way to do this? Are there any tools I can use? I am new to this and looking for a jump off point.

Best Answer

State of the art is to use semantic hashing by Hinton and Salakhutdinov. If you have a look into the paper, there are some really impressive 2D plots of several benchmark datasets.

It is a rather advanced algorithm, however. You train a stack of restricted boltzmann machines with contrastive divergence. In the end your representation of a document will be a bit vector. This can be used to do lookups based on the hamming distance.

Lots of machine learning knowledge goes is required to sucessfully implement this, and as far as I know there is no out of the box. If you want to do this and you have no prior knowledge in neural networks et al, it will take quite some effort.

Related Question