The question is:
What is the difference between classical k-means and spherical k-means?
Classic K-means:
In classic k-means, we seek to minimize a Euclidean distance between the cluster center and the members of the cluster. The intuition behind this is that the radial distance from the cluster-center to the element location should "have sameness" or "be similar" for all elements of that cluster.
The algorithm is:
- Set number of clusters (aka cluster count)
- Initialize by randomly assigning points in the space to cluster indices
- Repeat until converge
- For each point find the nearest cluster and assign point to cluster
- For each cluster, find the mean of member points and update center mean
- Error is norm of distance of clusters
Spherical K-means:
In spherical k-means, the idea is to set the center of each cluster such that it makes both uniform and minimal the angle between components. The intuition is like looking at stars - the points should have consistent spacing between each other. That spacing is simpler to quantify as "cosine similarity", but it means there are no "milky-way" galaxies forming large bright swathes across the sky of the data. (Yes, I'm trying to speak to grandma in this part of the description.)
More technical version:
Think about vectors, the things you graph as arrows with orientation, and fixed length. It can be translated anywhere and be the same vector. ref
The orientation of the point in the space (its angle from a reference line) can be computed using linear algebra, particularly the dot product.
If we move all the data so that their tail is at the same point, we can compare "vectors" by their angle, and group similar ones into a single cluster.
For clarity, the lengths of the vectors are scaled, so that they are easier to "eyeball" compare.
You could think of it as a constellation. The stars in a single cluster are close to each other in some sense. These are my eyeball considered constellations.
The value of the general approach is that it allows us to contrive vectors which otherwise have no geometric dimension, such as in the tf-idf method, where the vectors are word frequencies in documents. Two "and" words added does not equal a "the". Words are non-continuous and non-numeric. They are non-physical in a geometric sense, but we can contrive them geometrically, and then use geometric methods to handle them. Spherical k-means can be used to cluster based on words.
So the (2d random, continuous) data was this:
$$
\begin{bmatrix}
x1&y1&x2&y2&group\\
0&-0.8&-0.2013&-0.7316&B\\
-0.8&0.1&-0.9524&0.3639&A\\
0.2&0.3&0.2061&-0.1434&C\\
0.8&0.1&0.4787&0.153&B\\
-0.7&0.2&-0.7276&0.3825&A\\
0.9&0.9&0.748&0.6793&C\\
\end{bmatrix}
$$
Some points:
- They project to a unit sphere to account for differences in document
length.
Let's work through an actual process, and see how (bad) my "eyeballing" was.
The procedure is:
- (implicit in the problem) connect vectors tails at origin
- project onto unit sphere (to account for differences in document length)
- use clustering to minimize "cosine dissimilarity"
$$ J = \sum_{i} d \left( x_{i},p_{c\left( i \right)} \right) $$
where
$$ d \left( x,p \right) = 1- cos \left(x,p\right) =
\frac{\langle x,p \rangle}{\left \|x \right \|\left \|p \right \|} $$
(more edits coming soon)
Links:
- http://epub.wu.ac.at/4000/1/paper.pdf
- http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.111.8125&rep=rep1&type=pdf
- http://www.cs.gsu.edu/~wkim/index_files/papers/refinehd.pdf
- https://www.jstatsoft.org/article/view/v050i10
- http://www.mathworks.com/matlabcentral/fileexchange/32987-the-spherical-k-means-algorithm
- https://ocw.mit.edu/courses/sloan-school-of-management/15-097-prediction-machine-learning-and-statistics-spring-2012/projects/MIT15_097S12_proj1.pdf
Best Answer
As far as I know, the "mean" of a cluster and the centroid of a single cluster are the same thing, though the term "centroid" might be a little more precise than "mean" when dealing with multivariate data.
To find the centroid, one computes the (arithmetic) mean of the points' positions separately for each dimension. For example, if you had points at:
then the centroid would be located at ((-1+0+1)/3, (10+5+20)/3, (3+2+10)/3), which simplifies (0, 11 2/3, 5). (NB: The centroid does not have to be--and rarely is---one of the original data points)
The centroid is also sometimes called the center of mass or barycenter, based on its physical interpretation (it's the center of mass of an object defined by the points). Like the mean, the centroid's location minimizes the sum-squared distance from the other points.
A related idea is the medoid, which is the data point that is "least dissimilar" from all of the other data points. Unlike the centroid, the medoid has to be one of the original points. You may also be interested in the geometric median which is analgous to the median, but for multivariate data. These are both different from the centroid.
However, as Gabe points out in his answer, there is a difference between the "centroid distance" and the "average distance" when you're comparing clusters. The centroid distance between cluster $A$ and $B$ is simply the distance between $\text{centroid}(A)$ and $\text{centroid}(B)$. The average distance is calculated by finding the average pairwise distance between the points in each cluster. In other words, for every point $a_i$ in cluster $A$, you calculate $\text{dist}(a_i, b_1)$, $\text{dist}(a_i, b_2)$ , ... $\text{dist}(a_i, b_n)$ and average them all together.