[Math] Clustering algorithm to cluster objects based on their relation weight

algorithmsclustering

I have $n$ words and their relatedness weight that gives me an $n\times n$ matrix. I'm going to use this for a search algorithm but the problem is I need to cluster the entered keywords based on their pairwise relation. So let's say if the keywords are {tennis,federer,wimbledon,london,police} and we have the following data from our weight matrix:

            tennis  federer  wimbledon  london  police      
tennis        1       0.8       0.6       0.4     0.0
federer       0.8      1        0.65      0.4     0.02
wimbledon     0.6     0.65       1        0.2     0.09
london        0.4     0.4       0.2        1      0.71
police        0.0     0.02      0.09      0.71     1

I need an algorithm to to cluster them into 2 clusters : {tennis,federer,wimbledon} {london,police}. Is there any known clustering algorithm than can deal with such thing ? I did some research, it appears that K-means algorithm is the most well known algorithm being used for clustering but apparently K-means doesn't suit this case.
I would greatly appreciate any help.

Best Answer

You should use principal components analysis. Here's a quick demo in Matlab. You enter your weights matrix:

weights =

1.0000    0.8000    0.6000    0.4000         0
0.8000    1.0000    0.6500    0.4000    0.0200
0.6000    0.6500    1.0000    0.2000    0.0900
0.4000    0.4000    0.2000    1.0000    0.7100
     0    0.0200    0.0900    0.7100    1.0000

and take its singular value decomposition with the command svd:

>> [u s] = svd(weights);

Now the columns of u contain information on how your words are related to each other and s contains information on how much information from the underlying table is contained in each column. Since all of your weights are greater than zero, the first column simply says that all of the weights are related (e.g. if your weights are correlations in search traffic, you can think of this as saying that each term increases when the overall amount of search traffic increases):

>> u(:,1)

ans =

0.5307
0.5419
0.4668
0.4054
0.2064

enter image description here

The second column of u contains information on how the search terms cluster together:

>> u(:,2)

ans =

 0.2396
 0.2380
 0.2475
-0.5478
-0.7243

enter image description here

This is telling you that the first three terms ("tennis", "Federer" and "Wimbledon") are related to each other, as are the second two ("London", "police")

The diagonal entries of the matrix s tell you what portion of the information is in each of the columns. More specifically, if we take their cumulative sum and normalize it by the total, we get a vector whose n'th entry tells us how much information is retained if we only look at the first n columns:

>> cumsum(diag(s))/sum(diag(s))

ans =

0.5300
0.8300
0.9315
0.9708
1.0000

This tells us that we retain 83% of the relevant information in the weights matrix if we only look at the first two columns.

Related Question