Solved – Why a large gamma in the RBF kernel of SVM leads to a wiggly decision boundary and causes over-fitting

classificationhyperparametermachine learningrbf kernelsvm

The hyperparameter $\gamma$ of the Gaussian/rbf kernel controls the tradeoff between error due to bias and variance in your model. If you have a very large value of gamma,
then even if your two inputs are quite “similar”, the value of the
kernel function will be small – meaning that the support vector $x_n$
does not have much influence on the classification of the testing
example $x_m$. This allows the SVM to capture more of the complexity and
shape of the data, but if the value of gamma is too large, then the
model can overfit and be prone to low bias/high variance.

which is from here(the second answer). I do understand the first part, i.e.
if gamma is large, the influence of a support vector won't reach far. However, I just can't figure out why a large gamma can lead to a wiggly decision boundary and
capture more of the complexity and shape of the training data,causing over-fitting thusly. Any hint will be helpful!

Best Answer

Using a kernelized SVM is equivalent to mapping the data into feature space, then using a linear SVM in feature space. The feature space mapping is defined implicitly by the kernel function, which computes the inner product between data points in feature space. That is:

$$\kappa(x_i, x_j) = \langle \Phi(x_i), \Phi(x_j) \rangle$$

where $\kappa$ is the kernel function, $x_i$ and $x_j$ are data points, and $\Phi$ is the featre space mapping. The RBF kernel maps points nonlinearly into an infinite dimensional feature space.

Larger RBF kernel bandwidths (i.e. smaller $\gamma$) produce smoother decision boundaries because they produce smoother feature space mappings. Forgetting about RBF kernels for the moment, here's a cartoon showing why smoother mappings produce simpler decision boundaries:

In this example, one-dimensional data points are mapped nonlinearly into a higher dimensional (2d) feature space, and a linear classifier is fit in feature space. The decision boundary in feature space is a plane, but is nonlinear when viewed in the original input space. When the feature space mapping is less smooth, the data can 'poke through' the plane in feature space in more complicated ways, yielding more intricate decision boundaries in input space.

Related Solutions

Machine Learning – Can Overfitting Result in Lower Training Accuracy in One-Class SVM?

Proportion classified correctly is a discontinuous improper scoring rule that is optimized by a bogus model. I would not believe anything that you learn from it.

Solved – Plotting the decision boundary of a kernel SVM (RBF)

I figured out what is needed to be done. Actually, it's something simple, but its seems I had a matlaboid bug... Here is the code and the resulting figure for the "XOR" binary classification problem.

gamma     = getGamma();
b         = getB();
points_x1 = linspace(xLimits(1), xLimits(2), 100);
points_x2 = linspace(yLimits(1), yLimits(2), 100);
[X1, X2]  = meshgrid(points_x1, points_x2);

% Initialize f
f = ones(length(points_x1), length(points_x2))*rho;

% Iter. all SVs
for i=1:N_sv
    alpha_i = getAlpha(i);
    sv_i    = getSV(i);
    for j=1:length(points_x1)
        for k=1:length(points_x2)
            x = [points_x1(j);points_x2(k)];
            f(j,k) = f(j,k) + alpha_i*y_i*kernel_func(gamma, x, sv_i);
        end
    end    
end

surf(X1,X2,f);
shading interp;
lighting phong;
alpha(.6)

contourf(X1, X2, f, 1);

where the function

function k = kernel_func(gamma, x, x_i)
    k = exp(-gamma*norm(x - x_i)^2);
end

just produces the kernel function (RBF kernel), $k(\mathbf{x},\mathbf{x}')=\operatorname{exp}\left(-\gamma\lVert\mathbf{x}-\mathbf{x}'\rVert^2\right)$.

Here is the result for the XOR problem. Here $\gamma=4$.

enter image description here

Best Answer

Related Solutions

Machine Learning – Can Overfitting Result in Lower Training Accuracy in One-Class SVM?

Solved – Plotting the decision boundary of a kernel SVM (RBF)

Related Question