My dataset is small (120 samples), however the number of features are large varies from (1000-200,000). Although I'm doing feature selection to pick a subset of features, it might still overfit.
My first question is, how does SVM handle overfitting, if at all.
Secondly, as I study more about overfitting in case of classification, I came to the conclusion that even datasets with small number of features can overfit. If we do not have features correlated to the class label, overfitting takes place anyways. So I'm now wondering what's the point of automatic classification if we cannot find the right features for a class label. In case of document classification, this would mean manually crafting a thesaurus of words that relate to the labels, which is very time consuming. I guess what I'm trying to say is, without hand-picking the right features it is very difficult to build a generalized model ?
Also, if the experimental results don't show that the results have low/no overfitting it becomes meaningless. Is there a way to measure it ?
Best Answer
In practice, the reason that SVMs tend to be resistant to over-fitting, even in cases where the number of attributes is greater than the number of observations, is that it uses regularization. They key to avoiding over-fitting lies in careful tuning of the regularization parameter, $C$, and in the case of non-linear SVMs, careful choice of kernel and tuning of the kernel parameters.
The SVM is an approximate implementation of a bound on the generalization error, that depends on the margin (essentially the distance from the decision boundary to the nearest pattern from each class), but is independent of the dimensionality of the feature space (which is why using the kernel trick to map the data into a very high dimensional space isn't such a bad idea as it might seem). So in principle SVMs should be highly resistant to over-fitting, but in practice this depends on the careful choice of $C$ and the kernel parameters. Sadly, over-fitting can also occur quite easily when tuning the hyper-parameters as well, which is my main research area, see
G. C. Cawley and N. L. C. Talbot, Preventing over-fitting in model selection via Bayesian regularisation of the hyper-parameters, Journal of Machine Learning Research, volume 8, pages 841-861, April 2007. (www)
and
G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection and subsequent selection bias in performance evaluation, Journal of Machine Learning Research, 2010. Research, vol. 11, pp. 2079-2107, July 2010. (www)
Both of those papers use kernel ridge regression, rather than the SVM, but the same problem arises just as easily with SVMs (also similar bounds apply to KRR, so there isn't that much to choose between them in practice). So in a way, SVMs don't really solve the problem of over-fitting, they just shift the problem from model fitting to model selection.
It is often a temptation to make life a bit easier for the SVM by performing some sort of feature selection first. This generally makes matters worse, as unlike the SVM, feature selection algorithms tend to exhibit more over-fitting as the number of attributes increases. Unless you want to know which are the informative attributes, it is usually better to skip the feature selection step and just use regularization to avoid over-fitting the data.
In short, there is no inherent problem with using an SVM (or other regularised model such as ridge regression, LARS, Lasso, elastic net etc.) on a problem with 120 observations and thousands of attributes, provided the regularisation parameters are tuned properly.