Solved – k-means cluster, How to re-calculate centroid when using cosine similarity

clusteringcosine distancecosine similarityk-means

I have a requirement using k-means cluster method with cosine similarity instead of Euclidean distance.

for example:

data a: a1 a2 a3 a4 ...
data b: b1 b2 b3 b4 ...

cosine similarity: $\displaystyle \frac{\mathbf{a}\cdot\mathbf{b}}{ |\mathbf{a}|\cdot|\mathbf{b}|}$

My problem is how I can re-calculate the centroid vector for each iteration base on cosine similarity?

Can I still use average e.g.: $\displaystyle \frac{(a_1 + b_1 + c_1)}{3}$ ?

Best Answer

You can normalize the vectors in each cluster by their lengths and add them up, then normalize the sum. The result will be a unit vector in the direction of the centroid (a.k.a. prototype) vector. As far as the spherical k-means algorithm is concerned, the length of the centroid vector does not matter and is not used. This is because to calculate the cosine distance between each cluster member and the centroid, both vectors are normalized by their lengths. See the following excerpt from this article:

centroid computation

If you really need a centroid vector with a representative length, you can take the average of the lengths of the cluster members and multiply it by the unit centroid vector. But this would be completely your choice and would have nothing to do with the k-means algorithm (you could use any other type of averaging, arithmetic, geometric, or just the length of the average vector to compute the representative centroid lenght).

The formula posted by Vijay Rajan is effectively the same (except giving a centroid vector of non-unit length), but note that in that formula too the vectors must be normalized to unit length before applying the formula. When calculated properly, the centroid does indeed "bisect" the angle between the vectors. (I don't currently have the forum privilege to make this a comment on their response.)