If you want to visualize KNN classification, there's a good example here taken from the book An Introduction to Statistical Learning, which can be downloaded freely from their webpage.
They also have several neat examples for KNN regression, but I have not found the code for those.
More to the point, the package you mentioned, kknn
, has built-in functionality for plot() for many of its functions, and you should browse the vignette, which contains several examples.
Cover's Theorem: Roughly stated, it says given any random set of finite points (with arbitrary labels), then with high probability these points can be made linearly separable [1] by mapping them to a higher dimension [2].
Implication: Great, what this theorem tells me is that if I take my dataset and map these points to a higher dimension, then I can easily find a linear classifier. However, most classifiers need to compute some kind of similarity like dot product and this means that the time complexity of a classification algorithm is proportional to the dimension of the data point. So, higher dimension means larger time complexity (not to mention space complexity to store those large dimensional points).
Kernel Trick: Let $n$ be the original dimension of data points and $f$ be the map which maps these points to a space of dimension $N (>> n)$. Now, if there is a function $K$ which takes inputs $x$ and $y$ from the original space and computes $K(x, y) = \langle f(x), f(y) \rangle$, then I am able to compute the dot product in higher dimensional space but in complexity $O(n)$ instead of $O(N)$.
Implication: So, if the classification algorithm is only dependent on the dot product and has no dependency on the actual map $f$, I can use the kernel trick to run the algorithm in high dimensional space with almost no additional cost.
Does Linear separability imply that points from the same class will get closer than the points from different classes?
No, there is no such guarantee as such. Linear separability doesn't really imply that the point from same class has gotten closer or that the points from two different classes have gotten any further.
So why would kNN work?
It need not! However, if it does, then it is purely because of the kernel.
What does that mean?
Consider the boolean feature vector $x = (x_1, x_2)$. When you use degree two polynomial kernel, the feature vector $x$ is mapped to the vector $(x_1^2, \sqrt{2} x_1x_2, x_2^2)$. From a vector of boolean features, just by using degree two polynomial, we have obtained a feature vector of "conjunctions". Thus, the kernels themselves produce some brilliant feature maps. If your data has good original features and if your data could benefit from the feature maps created by these kernels. By benefit, I mean that the features produced by these feature maps can bring the points from the same class closer to each other and push points from different classes away, then kNN stands to benefit from using kernels. Otherwise, the results won't be any different than what you get from running kNN on the original data.
Then why use kernel kNN?
We showed that the computation complexity of using kernels is just slightly more than the usual kNN and if data benefits from using kernels then why not use them anyway?
Is there any paper that has studied which class of data that can benefit from kernels in kNN?
As far as I know, No.
[1] http://en.wikipedia.org/wiki/Linear_separability
[2] http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4038449&tag=1
Best Answer
No, definitely not; kmeans and kNN are two completely different things, and kmeans doesn't use kNN at all. In step 2 of the kmeans algorithm as you have defined it (and BTW, except for this minor confusion, your summary is otherwise basically accurate), the kmeans function loops through each of the $n$ data points, and tests the distance (usually Euclidian, but it doesn't have to be, it could be Manhattan distance, or Chebyshev, or Minkowski or whatever you like, really) of each point from each of the $m$ centroids. The closest centroid "wins", and each point is grouped with its winning centroid, as computed on the previous iteration, in order to compute the new values of all the centroids for the next iteration.
The kNN algorithm answers a somewhat different question. Suppose I give you a so-called "training data set" which consists of a table with multiple columns. In the first column is a so-called "class value", basically a label which you might think of as identifying different types of real-world objects that you would like a computer to be able to automatically recognize. All of the other columns in the table specify the values of various "features" associated with each object; for example, it's length, mass, aspect ratio, brightness, color, or whatever other quantity you may choose to measure. Now, suppose that I give you a second data set, a so-called "test" data set, where all of the values in the data columns have been measured in just the same way as in the training data set, BUT the labels have all gone missing! I would like you to infer the labels which are missing from the test data set by comparing the features of the test data set to those of known examples in the training set, and looking for patterns of similarities between them.
This second type of problem which I have just described is known as "classification", or sometimes, "supervised machine learning", and there are many different strategies for dealing with it. The kNN algorithm is one of the simplest of those strategies. Basically, the kNN algorithm considers each unknown object in the test data set, and it finds the $k$ "nearest" (by whatever distance metric the user specifies) examples in the training set. Then, whichever label was most common among those top $k$ examples within the training set, that is the label which is assigned to the unknown object in the test set.
Bottom line, kNN is a completely different thing, and has nothing to do with subdividing a previously undifferentiated data set into empirically defined clusters.