SVM – Why Scaling Features Can Decrease SVM Performance

svm

I have used scaling on features of a model which contains 40 features (all columns are numbers) and a binary output variable.

This is the Kaggle contest here I've scaled the features assuming it would deliver better performance, but with a rbf kernel SVM, the accuracy with 10 fold CV fell from 0.92 to 0.87

Here is a box plot of features before and after scaling:

enter image description here

What I would like to know is why scaling decreases classifier performance? I have not seen any discussions that point at this type of outcome.

Best Answer

The problem is that you used the default parameter values in both cases. Apparently, the default values happened to be better for your data set before scaling (this is a coincidence).

When using SVM, the parameters $c$ and $\gamma$ play a crucial role and it is your task to find the best values. Your intuition is correct: the optimal performance is better when all features are scaled properly (or at least 99.99% of the time). Unfortunately, neither of your settings had optimal parameters which led to a result that seemed to reject your intuition.

Searching the optimal values for $c$ and $\gamma$ is typically done via a grid search (e.g. search a set of $<c,\gamma>$ combinations). You can estimate the performance of an SVM for a given set of parameters using cross-validation.

In pseudo-code, the general idea is this:

for c in {set of possible c values}
    for gamma in {set of possible gamma values}
        perform k-fold cross-validation to find accuracy
    end
end
train svm model on full training set with best c,gamma-pair

You can find a good beginner's tutorial here.

Related Solutions

Solved – How does gamma in SVM RBF kernel influence the accuracy

For simplicity, first scale your data $X$ so that $median \|X_i - X_j\| \approx 1$: half the neighbors are < 1 away, and half > 1, on average.
What $e^{-gamma\ dist^2 }$ does is down-weight, attenuate, more distant neighbors. By how much ? Make a little table:

dist:                  [0    .5  1   2    3]
                       ---------------------
exp( - 0.3 * dist^2 ): [100  93  74  30   7] %
exp( -   1 * dist^2 ): [100  78  37   2   0] %
exp( -   3 * dist^2 ): [100  47   5   0   0] %

So $gamma = 3$ down-weights half the points by 5 % .. 0,
$gamma = 1$ by 37 % .. 0,
$gamma = 0.3$ even less. (The range 0.3 .. 3 is way too big.)

A simple rule of thumb: start with $gamma = 3$, for distances scaled to median 1.

Could you try $gamma = 2, 3, 4$ for your scaled data ?
Also, plotting the sample distributions of $dist = \|X_i - X_j\|$ and $e^{ -gamma\ dist^2 }$ might be useful.

Solved – Does the order of features affect the solution model classifier accuracy in an SVM using RBF kernel

In general, it should not make a difference. For many methods (Naive Bayes, Decision Trees, Regression) this is not a factor. For SVM, it may depend on the type of SVM and the method used to solve it - if the algorithm used is approximate, or not run to convergence, or involves randomness that may lead to somewhat different results.

Best Answer

Related Solutions

Solved – How does gamma in SVM RBF kernel influence the accuracy

Solved – Does the order of features affect the solution model classifier accuracy in an SVM using RBF kernel

Related Question