SVM RBF Kernel – Heuristic Method for Estimating Gamma

cross-validationmachine learningsvm

I read on this exchange a heuristic method of estimating gamma for the rbf kernel in SVMs. I was wondering if someone might be able to explain it to me in a little more detail? I believe you select 1000 (or a large number) of pairs of datapoints from the dataset then calculate the norm for the difference of each pair. Apparently, the inverse of the .1, .9 quantiles and median are good candidates for a suitable gamma for the rbf kernel.

Thanks

Best Answer

First of all, there is no reason-except computational cost-not to use your whole dataset. As long as you don't use label information, there is not reason not to use all the information you can get from your data.

Why are quantiles of the distance a good heuristic? The solution of an SVM problem is a linear combination of the RBF kernels that sit on the support vectors $\sum_i y_i \alpha_i \exp(-\gamma ||x - x_i||^2)$. During the learning phase, the optimization adapts the $\alpha_i$ to maximize the margin while retaining correct classification.

Now, there are two extreme cases for the choice of $\gamma$:

  1. Imagine the $\gamma$ is very small, which means that the RBF kernel is very wide. Let us assume that it is so wide that the RBF kernel is still sufficiently positive for every data point of the dataset. This will give probably give the optimizer a hard job since changing the value of a single $\alpha_i$ will change the decision function on all datapoints because the kernel is too wide.
  2. The other extreme situation is when the $\gamma$ is large, which means that the RBF kernel is very narrow. When changing the $\alpha_i$ for that datapoint the decision function of the SVM will basically change only for that datapoint only. This means that probably all training vectors will end up as support vectors. This is clearly not desireable.

To see that the heuristic is a good choice, one must realize that a certain value of $\gamma$ determines a boundary for the RBF kernel in which the kernel will be larger than a certain value (like the one-$\sigma$-quantile for the Normal distribution). By choosing the $\gamma$ according to quantiles on the pairwise distances you make sure that a certain percentage of the datapoints lies within that boundary. Therefore, if you change the $\alpha_i$ for a datapoint you will in fact only affect the decision function for a certain percentage of datapoints which is what you want. How that percentage should be chosen depends on the learning problem, but you avoid changing the decision function for all or only one datapoint.