Solved – SVM parameter dependence on number of samples

machine learningsvm

I need to do a grid search to optimize SVM parameters gamma, C and epsilon (svm from e1071 r package). The problem is that I have a fairly large data set, about 100000 rows and 40 variables.

I have concluded that I can probably survive grid search and cross validation on 40000 samples of data. But can optimized parameters for subset of 40000 data be used on the final model of 100000 or are the parameters dependent on sample size, and how?

Is any of the 3 parameters independent of other two, or at least in some stable range? For example, lets say I optimize just for gamma and C by holding epsilon constant. I know that C and epsilon influence the complexity of the model in different ways, and then I use best gamma from previous search, keep it constant and determine the other two arguments. Is this likely to give me a good result, or are all 3 parameters strongly depend on each other?

Best Answer

The $C$ parameter for most SVM implementations scales approximately linearly with the number of training patterns, so if you train with a subset, you will need to multiply $C$ by a factor of $N_s/N_t$ where $N_t$ is the number of training patterns and $N_s$ is the subset size for the parameter optimization. This is only an approximation though, so it may not give as good an estimate as performing model selection with the full training set.