In k-means clustering we initially pick $k$ random centroids and assign the given data to one of these $k$ centroids (which ever is nearest). After this we create new centroids by taking the mean of the assigned points.
However there might be case that the initially selected random centroids may not be nearest to any point in the dataset and hence no points would be assigned to these centroids. So in such case what should be done in the step of creating new centroids?
Best Answer
I am not sure if there is a "standard" thing to do in the case one of the initial centroids is completely off.
You can easily test this by specifying the initial centroids and see how things evolve!
For instance, R will just give you an error.
Say you do:
Now, R has obviously no issue in discriminating the 3 clusters when you let it choose the initial centroids, but when you run it the second time it will just say:
I guess that if you are implementing your own algorithm you may choose to use this behaviour or rather give the user a warning and let the algorithm choose the centroids by itself.
Obviously, as others pointed out, there are algorithms such as k-means++ that help in choosing a good set of starting centroids.
Also, in R you can use the
nstart
parameter of the kmeans function to run several iterations with different centroids: this will improve clustering in certain situations.EDIT: also, note from the R
kmeans
help page