Solved – Normalized Data vs. Non Normalized Data in SVM Training

classificationlibsvmsvm

Im training an SVM on a dataset containing 17000 events with 30 features. as I see it, this amount of features against the dataset size should suffice by itself for avoiding overfitting.

Im training an RBF kernel and I get somewhat strange results. Im pretty sure my training flow is correct, so it is not the main issue. The main issue is the difference between normalized to non normalized data and it's effect on the training error.

Here is what I do:

Im training a classifier with 'rbf' kernel, C is 16 and gamma is 8. Im checking the training error by plotting the ROC on the training set. Im doing this once for the original data, and once for the normalized data (I normalize by removing the average and dividing by (max-min) for each column)

I get the following graphs:

For normed data:
Normalized Data

For the original Data:

None Normed Data

This is quit different. Any ideas why it is like that?

Best Answer

Im training a classifier with 'rbf' kernel, C is 16 and gamma is 8.

This is the problem. You cannot use the same hyperparameters before and after normalization. You will need to tune both $C$ and $\gamma$ again after normalization. If the same hyperparameters would yield similar performance before and after normalization, the normalization basically had no effect.

Im training an SVM on a dataset containing 17000 events with 30 features. as I see it, this amount of features against the dataset size should suffice by itself for avoiding overfitting.

Overfitting is avoided through regularization and hence depends entirely on the value of the associated hyperparameter $C$ and the complexity of the kernel (VC dimension), not on the number of training instances. A high number of training instances does not protect you from overfitting when using SVM.

To see this, recall that an SVM with nonlinear kernel is actually a linear combination of kernel evaluations between the test point and the support vectors (training instances). In SVM, the number of training instances is actually the number of degrees of freedom.

Given a sufficiently complex kernel and high misclassification penalty $C$, you can construct an SVM model with perfect training classification for any number of training instances. As an example, consider the RBF kernel:

$$\kappa(\mathbf{x},\mathbf{y}) = \exp(-\gamma \|\mathbf{x}-\mathbf{y}\|^2)$$

For increasing $\gamma$, the resulting kernel matrix converges to the unit matrix. Combine this with a large value of $C$ and you end up with a model that has perfect training classification (for any number of training instances) but is entirely worthless on unseen data.

Im checking the training error by plotting the ROC on the training set.

I assume you are aware that the plots you have included are not valid ROC curves.

Related Question