Grid search is a sensible procedure as @JohnSmith suggests, however it is not the only stable technique. I generally use the Nelder-Mead simplex algortihm, which I have found to be very reliable and more efficient than grid search as less time is spent investigating areas of hyper-parameter space that give poor models. If you are a MATLAB user, you can get my implementation of this method here. Nelder Mead simplex methods are attractive as they don't require gradient information (I suppose you can think of the gradient of the simplex as being an approximation of the local gradient) and is very easily implemented.
Also gradient descent optimisation of the span bound is a good way of optimising the hyper-parameters, see Chapelle et al. (also investigate the other papers that Olivier has written, there are some real gems).
One advantage of grid search however is that it is quite easy to over-fit the cross-validation error (or span bound etc.) when optimising the hyper-parameters, especially if there is little data and many hyper-parameters. This means you can get a very poor classifier if you tune the hyper-parameters to convergence, so a coarse grid search can often perform better (I have a paper in preparation on this).
My paper in JMLR addresses this exact question, and demonstrates why the procedure suggested in the question (or at least one very like it) results in optimistically biased performance estimates:
Gavin C. Cawley, Nicola L. C. Talbot, "On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation", Journal of Machine Learning Research, 11(Jul):2079−2107, 2010. (www)
The key thing to remember is that cross-validation is a technique for estimating the generalisation performance for a method of generating a model, rather than of the model itself. So if choosing kernel parameters is part of the process of generating the model, you need to cross-validate the model selection process as well, otherwise you will end up with an optimistically biased performance estimate (as will happen with the procedure you propose).
Assume you have a function fit_model, which takes in a dataset consisting of attributes X and desired responses Y, and which returns the fitted model for that dataset, including the tuning of hyper-parameters (in this case kernel and regularisation parameters). This tuning of hyper-parameters can be performed in many ways, for example minimising the cross-validation error over X and Y.
Step 1 - Fit the model to all available data, using the function fit_model. This gives you the model that you will use in operation or deployment.
Step 2 - Performance evaluation. Perform repeated cross-validation using all available data. In each fold, the data are partitioned into a training set and a test set. Fit the model using the training set (record hyper-parameter values for the fitted model) and evaluate performance on the test set. Use the mean over all of the test sets as a performance estimate (and perhaps look at the spread of values as well).
Step 3 - Variability of hyper-parameter settings - perform analysis of hyper-parameter values collected in step 3. However I should point out that there is nothing special about hyper-parameters, they are just parameters of the model that have been estimated (indirectly) from the data. They are treated as hyper-parameters rather than parameters for computational/mathematical convenience, but this doesn't have to be the case.
The problem with using cross-validation here is that the training and test data are not independent samples (as they share data) which means that the estimate of the variance of the performance estimate and of the hyper-parameters is likely to be biased (i.e. smaller than it would be for genuinely independent samples of data in each fold). Rather than repeated cross-validation, I would probably use bootstrapping instead and bag the resulting models if this was computationally feasible.
The key point is that to get an unbiased performance estimate, whatever procedure you use to generate the final model (fit_model) must be repeated in its entirety independently in each fold of the cross-validation procedure.
Best Answer
First of all, there is no reason-except computational cost-not to use your whole dataset. As long as you don't use label information, there is not reason not to use all the information you can get from your data.
Why are quantiles of the distance a good heuristic? The solution of an SVM problem is a linear combination of the RBF kernels that sit on the support vectors $\sum_i y_i \alpha_i \exp(-\gamma ||x - x_i||^2)$. During the learning phase, the optimization adapts the $\alpha_i$ to maximize the margin while retaining correct classification.
Now, there are two extreme cases for the choice of $\gamma$:
To see that the heuristic is a good choice, one must realize that a certain value of $\gamma$ determines a boundary for the RBF kernel in which the kernel will be larger than a certain value (like the one-$\sigma$-quantile for the Normal distribution). By choosing the $\gamma$ according to quantiles on the pairwise distances you make sure that a certain percentage of the datapoints lies within that boundary. Therefore, if you change the $\alpha_i$ for a datapoint you will in fact only affect the decision function for a certain percentage of datapoints which is what you want. How that percentage should be chosen depends on the learning problem, but you avoid changing the decision function for all or only one datapoint.