Solved – Word clustering using different algorithms

clustering

At the moment I'm researching clustering of single words. The input of this research is a list of words (unigrams). During the research I want to compare different clustering algorithms to see the how their advantages and disadvantages affect the outcomes.

At the moment I have identified two steps in the research and that are disambiguation and clustering. During the first step (disambiguation) the true sense of the words will be determined. I propose that there will be a latent context available between the words that can help with finding the true sense of the words. This has been tested using word sense disambiguation algorithms and works. With this algorithm (or other if needed) I can calculate word similarity or distance. Thus, the output of the first step can be a similarity matrix. This similarity matrix will be the input of the cluster step. Hierarchical clustering algorithms pose no problem, algorithms like agglomerative clustering can easily be applied on based on the matrix, but I also want to see how non-hierarchical cluster algorithms like: K-means or the soft version Expectation-Maximization (EM) algorithm.

Although I cannot find how an similarity or distance matrix can be used as dataset in a k-means algorithm. When looking at these algorithms (e.g. http://phpir.com/clustering) I see that you need at least a X and Y, also to plot them on a two dimensional plain, but I don't know where to find these variables in my matrix. I tried to transform the matrix using multidimensional scaling, although the results were inconclusive. Some clusters where visible, but I have the feeling that I'm missing something.

My question is: how can I apply similarity- or distance matrix to algorithms like k-means, EM or or even other algorithms.

Best Answer

Clustering algorithms like KMeans take a matrix where the rows are the observation and the columns are features for that observation. If you have a set of words (w1,w2,...,wn) then you could set up your similarity matrix where the rows represent a specific word and each column is the similarity to each other word. For Example:

      w1 |  w2  ... wn
--------------------
w1 | 1.0 | .12  |  .03
--------------------
w2 | .12 | 1.0  |  .32
--------------------
...|     |      |
--------------------
w3 | .02 | .32  |  1.0

Where any row,column value would represent the the similarity between the word along the row with the word along the column.

This would be your X matrix which you could feed to a clustering algorithm like KMeans (scikit-learn page)