The answer to this question is rather long, I'll try to keep it short then.
The major differences between KNN and SVMs are:
- SVMs needs a proper training phase whereas KNN directly classifies points thanks to a given distance metric
- in SVMs the optimum is guaranteed due to the fact that behind the training phase there is an optimization problem involved: indeed, SVMs aim at separating the classes with the optimal hyperplane. KNN, as instead, is quasi-optimal.
- unlike KNN (which can work for any given number of classes), standard SVMs are binary classifiers (the hyperplane separates two classes). In order to construct a multiclass SVM environment you have to use the One-vs-One approach or the One-vs-All approach. In many toolboxes such as LibSVM, multiclass SVMs are implemented as well. Just for the records:
- in the One-vs-All approach you must train as many SVMs as there are classes (the ith SVM will see the ith class as positive and the others as negatives), then you feed the unknown pattern to the whole ensemble and the final outcome is assigned to the SVM who has the largest decision value amongst all the SVMs. You can see the decision value as a function of the margin from the hyperplane: the higher the margin, the more confident the prediction.
- in the One-vs-One approach you must train
N*(N-1)/2
SVMs: one SVM for every pair of classes. As above, you feed the unknown pattern to the ensemble and the final outcome is assigned thanks to a majority vote amongst all the outcomes from all the SVMs. Most toolboxes implement this approach when it comes to a multiclass classification.
So as you can see the SVMs look more cumbersome when it comes to computational time. However, keep in mind that such training can easily be done once. Then you can save the trained model(s) and just perform predictions using these model(s). As instead, the KNN must evaluate all the distances and select the K neighbours for every new pattern.
I strongly discourage you to implement your own SVMs. The training algorithms are rather complex if you're not expert in programming since (again) there is an optimization problem involved. I don't know in which language you're planning to code but Matlab has its own SVM library, OpenCV has its own library but (in my opinion) the best is LibSVM: it's free, it's cross-plattform and cross-language and it's fast. I always use it.
Another important aspect of the SVMs is the kernel. It is clear that in KNN you just need to define a distance metric. In SVMs you might have two major cases:
- classes are linearly separable, in this case you don't have to do nothing since the hyperplane will be evaluated thanks to the dot product between patterns.
- classes are not linearly separable, in this case you must think of a kernel function that maps such patterns in a different (sometimes higher dimension) space in which then classes will be linearly separable. In this case you must then select the appropriate kernel functions: common kernels are polynomials and Gaussian Radial Basis Function.
So in KNN you must only tune the K parameter and select an appropriate distance metric whereas in SVMs you must select a C parameter (which is a regularization term) and eventually the parameters for the kernel in case your classes are not linearly separable (in the polynomial kernel you must specify the polynomial degree and its coefficients whereas in the Gaussian RBF kernel you must specify the shape parameter - i.e. standard deviation).
My suggestion is try to use KNN (simply because it's easier since you have never used SVMs before) even though it's rather impossible to predict the error rate (%) as requested due to the fact that there are several things involved:
- value for K
- distance metric
- how your letters are coded (are they vectors? matrices?...) and how big they are (font size also matters!)
But it would be a nice experiment (if you have time, I don't know about any deadlines for this project) to see how things are different when using SVMs. Maybe have a look at LibSVM and initially try using the linearly separable case (i.e. the kernel function is simply a dot product) and then try to experiment with more sophisticated kernels.
In conclusion, even though SVMs look tough, they provide an optimal solution. Both SVMs and KNN are widely used then it comes to OCR but several experiments proved SVMs the best (with a minor difference in terms of accuracy, something like 2% higher then KNN). You can see this article and the above mentioned OpenCV provides also examples for OCR classification in the KNN case and the SVM case.
A common way to normalize a SOM is to scale features to unit variance. The mean is subtracted from each observation and divided by the standard deviation, which is in the range [0, 1].
If you normalize the training set, but not the validation set, then you are likely comparing observations on different scales. I'd suggest using the mean and stds of the training set to normalize the validation set.
Best Answer
RESCALING attribute data to values to scale the range in [0, 1] or [−1, 1] is useful for the optimization algorithms, such as gradient descent, that are used within machine learning algorithms that weight inputs (e.g. regression and neural networks). Rescaling is also used for algorithms that use distance measurements for example K-Nearest-Neighbors (KNN). Rescaling like this is sometimes called "normalization". MinMaxScaler class in python skikit-learn does this.
NORMALIZING attribute data is used to rescale components of a feature vector to have the complete vector length of 1. This is "scaling by unit length". This usually means dividing each component of the feature vector by the Euclidiean length of the vector but can also be Manhattan or other distance measurements. This pre-processing rescaling method is useful for sparse attribute features and algorithms using distance to learn such as KNN. Python scikit-learn Normalizer class can be used for this.
STANDARDIZING attribute data is also a preprocessing method but it assumes a Gaussian distribution of input features. It "standardizes" to a mean of 0 and a standard deviation of 1. This works better with linear regression, logistic regression and linear discriminate analysis. Python StandardScaler class in scikit-learn works for this.