Solved – Is a high number of support vectors a sign of overfitting

machine learningsvm

I've tuned a SVM with radial kernel that has training error about 10%,
but test error is about 38%, which surprise me.

I tried to understand what may cause this and noticed the number of support vectors is high, about 50% of the observation are support vectors. Is this the cause of the problem? Does high number of support vectors leads to a high
variance SVM?

Here is the summary of the SVM.

Parameters:

   SVM-Type:  C-classification 

 SVM-Kernel:  radial 

       cost:  31.62 

      gamma:  1e-04 

Number of Support Vectors:  117

 ( 58 59 )

Number of Classes:  2 

Levels: 
 high low

Best Answer

YES, a large number of support vectors is often a sign of overfitting.

The problem appears to be that you have chosen optimal hyperparameters based on training set performance, rather than independent test set performance (or, alternatively, cross-validated estimates).

The problem

Never evaluate a set of hyperparameters based on training set performance, because you will inevitably end up with an immense overfit. Consider the RBF kernel: $$K(\mathbf{u},\mathbf{v}) = \exp(-\gamma \|\mathbf{u}-\mathbf{v}\|^2).$$ Now, take the following limit case: $$ \lim_{\gamma\rightarrow\infty} K(\mathbf{u},\mathbf{v}) = \left\{ \begin{matrix} 0, & \text{if } \mathbf{u}\neq\mathbf{v},\\ 1, & \text{otherwise.} \end{matrix} \right. $$ Which essentially means that, for $\gamma\rightarrow\infty$ the kernel matrix becomes the unit matrix. This allows the resulting model to always fit the training data perfectly, with each instance becoming a support vector, but learn absolutely nothing.

The solution

If you want to optimize hyperparameters, you must always estimate the generalization performance of a given set of hyperparameters based on unseen data. The most common approach is via cross-validation, e.g. $k$-fold or leave-one-out.

Note that, when you are tuning hyperparameters based on cross-validation, it is still effectively possible to overfit. You can reduce this to some extent by using several iterations and/or more folds, but the inherent issue remains.