I got this question in a quiz, it asked what will be the training error for a KNN classifier when K=1. What does training mean for a KNN classifier? My understanding about the KNN classifier was that it considers the entire data-set and assigns any new observation the value the majority of the closest K-neighbors. Where does training come into the picture? Also the correct answer provided for this was that the training error will be zero irrespective of any data-set. How is this possible?
Solved – Training error in KNN classifier when K=1
classificationk nearest neighboursupervised learning
Related Solutions
The answer to this question is rather long, I'll try to keep it short then.
The major differences between KNN and SVMs are:
- SVMs needs a proper training phase whereas KNN directly classifies points thanks to a given distance metric
- in SVMs the optimum is guaranteed due to the fact that behind the training phase there is an optimization problem involved: indeed, SVMs aim at separating the classes with the optimal hyperplane. KNN, as instead, is quasi-optimal.
- unlike KNN (which can work for any given number of classes), standard SVMs are binary classifiers (the hyperplane separates two classes). In order to construct a multiclass SVM environment you have to use the One-vs-One approach or the One-vs-All approach. In many toolboxes such as LibSVM, multiclass SVMs are implemented as well. Just for the records:
- in the One-vs-All approach you must train as many SVMs as there are classes (the ith SVM will see the ith class as positive and the others as negatives), then you feed the unknown pattern to the whole ensemble and the final outcome is assigned to the SVM who has the largest decision value amongst all the SVMs. You can see the decision value as a function of the margin from the hyperplane: the higher the margin, the more confident the prediction.
- in the One-vs-One approach you must train
N*(N-1)/2
SVMs: one SVM for every pair of classes. As above, you feed the unknown pattern to the ensemble and the final outcome is assigned thanks to a majority vote amongst all the outcomes from all the SVMs. Most toolboxes implement this approach when it comes to a multiclass classification.
So as you can see the SVMs look more cumbersome when it comes to computational time. However, keep in mind that such training can easily be done once. Then you can save the trained model(s) and just perform predictions using these model(s). As instead, the KNN must evaluate all the distances and select the K neighbours for every new pattern.
I strongly discourage you to implement your own SVMs. The training algorithms are rather complex if you're not expert in programming since (again) there is an optimization problem involved. I don't know in which language you're planning to code but Matlab has its own SVM library, OpenCV has its own library but (in my opinion) the best is LibSVM: it's free, it's cross-plattform and cross-language and it's fast. I always use it.
Another important aspect of the SVMs is the kernel. It is clear that in KNN you just need to define a distance metric. In SVMs you might have two major cases:
- classes are linearly separable, in this case you don't have to do nothing since the hyperplane will be evaluated thanks to the dot product between patterns.
- classes are not linearly separable, in this case you must think of a kernel function that maps such patterns in a different (sometimes higher dimension) space in which then classes will be linearly separable. In this case you must then select the appropriate kernel functions: common kernels are polynomials and Gaussian Radial Basis Function.
So in KNN you must only tune the K parameter and select an appropriate distance metric whereas in SVMs you must select a C parameter (which is a regularization term) and eventually the parameters for the kernel in case your classes are not linearly separable (in the polynomial kernel you must specify the polynomial degree and its coefficients whereas in the Gaussian RBF kernel you must specify the shape parameter - i.e. standard deviation).
My suggestion is try to use KNN (simply because it's easier since you have never used SVMs before) even though it's rather impossible to predict the error rate (%) as requested due to the fact that there are several things involved:
- value for K
- distance metric
- how your letters are coded (are they vectors? matrices?...) and how big they are (font size also matters!)
But it would be a nice experiment (if you have time, I don't know about any deadlines for this project) to see how things are different when using SVMs. Maybe have a look at LibSVM and initially try using the linearly separable case (i.e. the kernel function is simply a dot product) and then try to experiment with more sophisticated kernels.
In conclusion, even though SVMs look tough, they provide an optimal solution. Both SVMs and KNN are widely used then it comes to OCR but several experiments proved SVMs the best (with a minor difference in terms of accuracy, something like 2% higher then KNN). You can see this article and the above mentioned OpenCV provides also examples for OCR classification in the KNN case and the SVM case.
You should probably try to reduce the number of variables to a sensible set before trying to classify using nearest neighbors. Otherwise you'll fall victim to the curse of dimensionality, which is referenced in the Wikipedia article on $k$-nearest neighbors. You might also consider some sort of scaling of the variables so that no particular attribute has an undue influence on your classifications.
Your Python code could also be simplified quite a bit. Instead of defining these functions you could use the inner product function from numpy
:
import math
import numpy as np
# inner product
np.dot(a, b)
# cosine similarity
np.dot(a, b) / math.sqrt(np.dot(a, a) * np.dot(b, b))
Best Answer
Training error here is the error you'll have when you input your training set to your KNN as test set. When K = 1, you'll choose the closest training sample to your test sample. Since your test sample is in the training dataset, it'll choose itself as the closest and never make mistake. For this reason, the training error will be zero when K = 1, irrespective of the dataset. There is one logical assumption here by the way, and that is your training set will not include same training samples belonging to different classes, i.e. conflicting information. Some real world datasets might have this property though.