Solved – How to compute term frequency and find clusters in a dataset composed of strings

clusteringinformation retrieval

I am currently looking for some Information Retrieval techniques.

I have a SQL database table containing strings. It has 1000 records, each being a random sentence I picked from random web sites. I need to get the term frequency and represent each string into a vector. I also need to cluster the records, e.g. using k-means.

Does anyone know what is the best way to do this? Are there any tools I can use? I am new to this and looking for a jump off point.

Best Answer

State of the art is to use semantic hashing by Hinton and Salakhutdinov. If you have a look into the paper, there are some really impressive 2D plots of several benchmark datasets.

It is a rather advanced algorithm, however. You train a stack of restricted boltzmann machines with contrastive divergence. In the end your representation of a document will be a bit vector. This can be used to do lookups based on the hamming distance.

Lots of machine learning knowledge goes is required to sucessfully implement this, and as far as I know there is no out of the box. If you want to do this and you have no prior knowledge in neural networks et al, it will take quite some effort.

Related Solutions

Solved – Reducing variables and correlated variables in a dataset

This should probably go as a comment but I cannot add it there because of low reputation. However, I think it can also partly serve as an answer to the OP's question.

As @NickCox alluded to, you should tackle this piecemeal. Read through the theory, work with a small number of variables and try to develop an intuitive sense of what is happening, although to develop this intuitive understanding will take some time, it will be worth it when you are trying to interpret the results of your principal component analysis.

If you look at the link @NickCox suggested and scroll down to whuber's answer you will find a good geometric explanation for what eigenvectors and eigenvalues mean and will give you a visual feel of what you are looking at. I will add to that answer - if you collapse his ellipsoid into a two dimensional space i.e look at only the first two principal components, you will get an ellipse with one major and one minor axis. The major axis is the "direction" which explains most of the variance of your data. It is represented by the first eigenvector and its length is the first eigenvalue (when you arrange the eigenvalues in descending order). It is also the linear combination of all the variables in your dataset, more specifically it is that linear combination which most explains the variance in your data.

Now, going back to your data you should look into textbooks on multivariate methods for methods of analyzing the covariance structure of your dataset. I am afraid I cannot point to anything specific but I think any intermediate level book would do. Additionally, I remember reading about applied principal component analysis from the book Statistical Methods in the Atmospheric Sciences by Daniel Wilks. If you get a hold of this book, you will find that it gives a good overview of the methodology in Chapter 11 (atmospheric scientists call PCA as Empirical Orthogonal Functions and it is the same thing). The author works with multi-variate data so it is structurally similar to what you have.

Lastly, in addition to clustering and PCA you can also look into Canonical Correlation Analysis (CCA). It is also used in reducing the dimensionality of one's dataset but it looks at the relationship between pairs of data. I cannot say whether it would be applicable in your case.

Solved – Rand index calculation

I was pondering about the same, and I solved it like this. Suppose you have a co-occurrence matrix/contingency table where the rows are the ground truth clusters, and the columns are the clusters found by the clustering algorithm.

So, for the example in the book, it would look like:

  | 1 | 2 | 3
--+---+---+---
x | 5 | 1 | 2
--+---+---+---
o | 1 | 4 | 0
--+---+---+---
◊ | 0 | 1 | 3

Now, you can very easily compute the TP + FP by taking the sum per column and 'choose 2' over all those values. So the sums are [6, 6, 5] and you do '6 choose 2' + '6 choose 2' + '5 choose 2'.

Now, indeed, similarly, you can get TP + FN by taking the sum over the rows (so, that is [8, 5, 4] in the example above), apply 'choose 2' over all those values, and take the sum of that.

The TP's themselves can be calculated by applying 'choose 2' to every cell in the matrix and taking the sum of everything (assuming that '1 choose 2' is 0).

In fact, here is some Python code that does exactly that:

import numpy as np
from scipy.misc import comb

# There is a comb function for Python which does 'n choose k'                                                                                            
# only you can't apply it to an array right away                                                                                                         
# So here we vectorize it...                                                                                                                             
def myComb(a,b):
  return comb(a,b,exact=True)

vComb = np.vectorize(myComb)

def get_tp_fp_tn_fn(cooccurrence_matrix):
  tp_plus_fp = vComb(cooccurrence_matrix.sum(0, dtype=int),2).sum()
  tp_plus_fn = vComb(cooccurrence_matrix.sum(1, dtype=int),2).sum()
  tp = vComb(cooccurrence_matrix.astype(int), 2).sum()
  fp = tp_plus_fp - tp
  fn = tp_plus_fn - tp
  tn = comb(cooccurrence_matrix.sum(), 2) - tp - fp - fn

  return [tp, fp, tn, fn]

if __name__ == "__main__":
  # The co-occurrence matrix from example from                                                                                                           
  # An Introduction into Information Retrieval (Manning, Raghavan & Schutze, 2009)                                                                       
  # also available on:                                                                                                                                   
  # http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html                                                                     
  #                                                                                                                                                      
  cooccurrence_matrix = np.array([[ 5,  1,  2], [ 1,  4,  0], [ 0,  1,  3]])

  # Get the stats                                                                                                                                        
  tp, fp, tn, fn = get_tp_fp_tn_fn(cooccurrence_matrix)

  print "TP: %d, FP: %d, TN: %d, FN: %d" % (tp, fp, tn, fn)

  # Print the measures:                                                                                                                                  
  print "Rand index: %f" % (float(tp + tn) / (tp + fp + fn + tn))

  precision = float(tp) / (tp + fp)
  recall = float(tp) / (tp + fn)

  print "Precision : %f" % precision
  print "Recall    : %f" % recall
  print "F1        : %f" % ((2.0 * precision * recall) / (precision + recall))

If I run it I get:

$ python testCode.py
TP: 20, FP: 20, TN: 72, FN: 24
Rand index: 0.676471
Precision : 0.500000
Recall    : 0.454545
F1        : 0.476190

I actually didn't check any other examples than this one, so I hope I did it right.... ;-)

Best Answer

Related Solutions

Solved – Reducing variables and correlated variables in a dataset

Solved – Rand index calculation

Related Question