Solved – Why a large gamma in the RBF kernel of SVM leads to a wiggly decision boundary and causes over-fitting

classificationhyperparametermachine learningrbf kernelsvm

The hyperparameter $\gamma$ of the Gaussian/rbf kernel controls the tradeoff between error due to bias and variance in your model. If you have a very large value of gamma,
then even if your two inputs are quite “similar”, the value of the
kernel function will be small – meaning that the support vector $x_n$
does not have much influence on the classification of the testing
example $x_m$. This allows the SVM to capture more of the complexity and
shape of the data, but if the value of gamma is too large, then the
model can overfit and be prone to low bias/high variance.

which is from here(the second answer). I do understand the first part, i.e.
if gamma is large, the influence of a support vector won't reach far. However, I just can't figure out why a large gamma can lead to a wiggly decision boundary and
capture more of the complexity and shape of the training data,causing over-fitting thusly. Any hint will be helpful!

Best Answer

Using a kernelized SVM is equivalent to mapping the data into feature space, then using a linear SVM in feature space. The feature space mapping is defined implicitly by the kernel function, which computes the inner product between data points in feature space. That is:

$$\kappa(x_i, x_j) = \langle \Phi(x_i), \Phi(x_j) \rangle$$

where $\kappa$ is the kernel function, $x_i$ and $x_j$ are data points, and $\Phi$ is the featre space mapping. The RBF kernel maps points nonlinearly into an infinite dimensional feature space.

Larger RBF kernel bandwidths (i.e. smaller $\gamma$) produce smoother decision boundaries because they produce smoother feature space mappings. Forgetting about RBF kernels for the moment, here's a cartoon showing why smoother mappings produce simpler decision boundaries:

enter image description here

In this example, one-dimensional data points are mapped nonlinearly into a higher dimensional (2d) feature space, and a linear classifier is fit in feature space. The decision boundary in feature space is a plane, but is nonlinear when viewed in the original input space. When the feature space mapping is less smooth, the data can 'poke through' the plane in feature space in more complicated ways, yielding more intricate decision boundaries in input space.