Is there an advantage to using higher dimensions (2D, 3D, etc) or should you just build x-1 single dimension classifiers and aggregate their predictions in some way?
This depends on whether your features are informative or not. Do you suspect that some features will not be useful in your classification task? To gain a better idea of your data, you can also try to compute pairwise correlation or mutual information between the response variable and each of your features.
To combine all (or a subset) of your features, you can try computing the L1 (Manhattan), or L2 (Euclidean) distance between the query point and each 'training' point as a starting point.
Since building all of these classifiers from all potential combinations of the variables would be computationally expensive. How could I optimize this search to find the the best kNN classifiers from that set?
This is the problem of feature subset selection. There is a lot of academic work in this area (see Guyon, I., & Elisseeff, A. (2003). An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 3, 1157-1182. for a good overview).
And, once I find a series of classifiers what's the best way to combine their output to a single prediction?
This will depend on whether or not the selected features are independent or not. In the case that features are independent, you can weight each feature by its mutual information (or some other measure of informativeness) with the response variable (whatever you are classifying on). If some features are dependent, then a single classification model will probably work best.
How do most implementations apply kNN to a more generalized learning?
By allowing the user to specify their own distance matrix between the set of points. kNN works well when an appropriate distance metric is used.
The answer to this question is rather long, I'll try to keep it short then.
The major differences between KNN and SVMs are:
- SVMs needs a proper training phase whereas KNN directly classifies points thanks to a given distance metric
- in SVMs the optimum is guaranteed due to the fact that behind the training phase there is an optimization problem involved: indeed, SVMs aim at separating the classes with the optimal hyperplane. KNN, as instead, is quasi-optimal.
- unlike KNN (which can work for any given number of classes), standard SVMs are binary classifiers (the hyperplane separates two classes). In order to construct a multiclass SVM environment you have to use the One-vs-One approach or the One-vs-All approach. In many toolboxes such as LibSVM, multiclass SVMs are implemented as well. Just for the records:
- in the One-vs-All approach you must train as many SVMs as there are classes (the ith SVM will see the ith class as positive and the others as negatives), then you feed the unknown pattern to the whole ensemble and the final outcome is assigned to the SVM who has the largest decision value amongst all the SVMs. You can see the decision value as a function of the margin from the hyperplane: the higher the margin, the more confident the prediction.
- in the One-vs-One approach you must train
N*(N-1)/2
SVMs: one SVM for every pair of classes. As above, you feed the unknown pattern to the ensemble and the final outcome is assigned thanks to a majority vote amongst all the outcomes from all the SVMs. Most toolboxes implement this approach when it comes to a multiclass classification.
So as you can see the SVMs look more cumbersome when it comes to computational time. However, keep in mind that such training can easily be done once. Then you can save the trained model(s) and just perform predictions using these model(s). As instead, the KNN must evaluate all the distances and select the K neighbours for every new pattern.
I strongly discourage you to implement your own SVMs. The training algorithms are rather complex if you're not expert in programming since (again) there is an optimization problem involved. I don't know in which language you're planning to code but Matlab has its own SVM library, OpenCV has its own library but (in my opinion) the best is LibSVM: it's free, it's cross-plattform and cross-language and it's fast. I always use it.
Another important aspect of the SVMs is the kernel. It is clear that in KNN you just need to define a distance metric. In SVMs you might have two major cases:
- classes are linearly separable, in this case you don't have to do nothing since the hyperplane will be evaluated thanks to the dot product between patterns.
- classes are not linearly separable, in this case you must think of a kernel function that maps such patterns in a different (sometimes higher dimension) space in which then classes will be linearly separable. In this case you must then select the appropriate kernel functions: common kernels are polynomials and Gaussian Radial Basis Function.
So in KNN you must only tune the K parameter and select an appropriate distance metric whereas in SVMs you must select a C parameter (which is a regularization term) and eventually the parameters for the kernel in case your classes are not linearly separable (in the polynomial kernel you must specify the polynomial degree and its coefficients whereas in the Gaussian RBF kernel you must specify the shape parameter - i.e. standard deviation).
My suggestion is try to use KNN (simply because it's easier since you have never used SVMs before) even though it's rather impossible to predict the error rate (%) as requested due to the fact that there are several things involved:
- value for K
- distance metric
- how your letters are coded (are they vectors? matrices?...) and how big they are (font size also matters!)
But it would be a nice experiment (if you have time, I don't know about any deadlines for this project) to see how things are different when using SVMs. Maybe have a look at LibSVM and initially try using the linearly separable case (i.e. the kernel function is simply a dot product) and then try to experiment with more sophisticated kernels.
In conclusion, even though SVMs look tough, they provide an optimal solution. Both SVMs and KNN are widely used then it comes to OCR but several experiments proved SVMs the best (with a minor difference in terms of accuracy, something like 2% higher then KNN). You can see this article and the above mentioned OpenCV provides also examples for OCR classification in the KNN case and the SVM case.
Best Answer
There is no such thing as the best classifier, it always depends on the context, what kind of data/problem is at hand. As you mention, kNN is slow when you have a lot of observations, since it does not generalize over data in advance, it scans historical database each time a prediction is needed.
With kNN you need to think carefully about the distance measure. For instance, if one feature is measured in 1000s of kilometers, another feature in 0.001 grams, the first feature will dominate the distance measure. You can normalize the features, or give certain importance weights, based on the domain knowledge.
Also, in a very high dimensional space the distance to all neighbors becomes more or less the same, and the notion of nearest and far neighbors becomes blurred.