Solved – Does K-means incorporate the K-nearest neighbour algorithm

k nearest neighbourk-means

I was watching this tutorial on K-means clustering and from what I understand K-means is:

  1. Randomly generate the centroids for k clusters
  2. Create a classification model dividing into k regions (Do we use kNN here?).
  3. Generate new centroids for each region
  4. Repeat 2-3.

In the video I was watching regarding K-means, it appears that the lecturer may have used kNN to create the dividing regions though I can't see anything online regarding if kNN is used in K-means.

Best Answer

No, definitely not; kmeans and kNN are two completely different things, and kmeans doesn't use kNN at all. In step 2 of the kmeans algorithm as you have defined it (and BTW, except for this minor confusion, your summary is otherwise basically accurate), the kmeans function loops through each of the $n$ data points, and tests the distance (usually Euclidian, but it doesn't have to be, it could be Manhattan distance, or Chebyshev, or Minkowski or whatever you like, really) of each point from each of the $m$ centroids. The closest centroid "wins", and each point is grouped with its winning centroid, as computed on the previous iteration, in order to compute the new values of all the centroids for the next iteration.

The kNN algorithm answers a somewhat different question. Suppose I give you a so-called "training data set" which consists of a table with multiple columns. In the first column is a so-called "class value", basically a label which you might think of as identifying different types of real-world objects that you would like a computer to be able to automatically recognize. All of the other columns in the table specify the values of various "features" associated with each object; for example, it's length, mass, aspect ratio, brightness, color, or whatever other quantity you may choose to measure. Now, suppose that I give you a second data set, a so-called "test" data set, where all of the values in the data columns have been measured in just the same way as in the training data set, BUT the labels have all gone missing! I would like you to infer the labels which are missing from the test data set by comparing the features of the test data set to those of known examples in the training set, and looking for patterns of similarities between them.

This second type of problem which I have just described is known as "classification", or sometimes, "supervised machine learning", and there are many different strategies for dealing with it. The kNN algorithm is one of the simplest of those strategies. Basically, the kNN algorithm considers each unknown object in the test data set, and it finds the $k$ "nearest" (by whatever distance metric the user specifies) examples in the training set. Then, whichever label was most common among those top $k$ examples within the training set, that is the label which is assigned to the unknown object in the test set.

Bottom line, kNN is a completely different thing, and has nothing to do with subdividing a previously undifferentiated data set into empirically defined clusters.