I have $n$ words and their relatedness weight that gives me an $n\times n$ matrix. I'm going to use this for a search algorithm but the problem is I need to cluster the entered keywords based on their pairwise relation. So let's say if the keywords are {tennis,federer,wimbledon,london,police} and we have the following data from our weight matrix:
tennis federer wimbledon london police
tennis 1 0.8 0.6 0.4 0.0
federer 0.8 1 0.65 0.4 0.02
wimbledon 0.6 0.65 1 0.2 0.09
london 0.4 0.4 0.2 1 0.71
police 0.0 0.02 0.09 0.71 1
I need an algorithm to to cluster them into 2 clusters : {tennis,federer,wimbledon} {london,police}. Is there any known clustering algorithm than can deal with such thing ? I did some research, it appears that K-means algorithm is the most well known algorithm being used for clustering but apparently K-means doesn't suit this case.
I would greatly appreciate any help.
Best Answer
You should use principal components analysis. Here's a quick demo in Matlab. You enter your weights matrix:
and take its singular value decomposition with the command
svd
:Now the columns of
u
contain information on how your words are related to each other ands
contains information on how much information from the underlying table is contained in each column. Since all of your weights are greater than zero, the first column simply says that all of the weights are related (e.g. if your weights are correlations in search traffic, you can think of this as saying that each term increases when the overall amount of search traffic increases):The second column of
u
contains information on how the search terms cluster together:This is telling you that the first three terms ("tennis", "Federer" and "Wimbledon") are related to each other, as are the second two ("London", "police")
The diagonal entries of the matrix
s
tell you what portion of the information is in each of the columns. More specifically, if we take their cumulative sum and normalize it by the total, we get a vector whose n'th entry tells us how much information is retained if we only look at the first n columns:This tells us that we retain 83% of the relevant information in the weights matrix if we only look at the first two columns.