Solved – Cross validation procedure – is this right

cross-validationmachine learningpredictive-modelssvm

Just want to check that I am performing my cross validation procedures right. I'm using a non-linear svm. I do a five fold cross validation (5 splits of test/train on my original training data) and for each fold, run a grid search to find the optimal parameters for that train/test pair (e.g. best parameters where model is fitted on the training data then evaluated on the test data). After five iterations, for each parameter set (2 hyper-parameters), i have 5 fit measures. I just use the average of those 5 #s and find the parameter set with the best average. Does this sound correct? I'm pretty new to this so am not entirely aware of what other methods are there.

Additionally, was wondering a few more things:
1) Any way to make it faster? Cross validation + grid search is pretty computationally intensive.
2) Any other validation methods rather than k-fold (stratified I might add) cv? My data are timeseries so I was considering a moving window type validation, but wasn't sure. Any thoughts welcome.

I should add that I'm asking because my training fit (the average over the 5 cv folds) is still much better than my actual test data fit. I'm trying to figure out what might be causing this and the best way to reduce this difference. I realize increasing the # of folds may help though it also raises question 1) as well.

Best Answer

Yes, the approach you are using is correct. As @halilpazarlama suggests the reason your cross-validation error is lower than the test error is indeed likely to be because you are over-fitting the cross-validation error. Essentially the cross-validation error is a performance estimator based on a finite same of data, and thus will have a (usually) non-negligible variance (i.e. if you ran the same experiment again with a different sample of data, you would get a slightly different minimum error and the minimum may well occur at different grid-point). Thus we can minimise the cross-validation error in two ways, ways that genuinely improve generalisation performance and ways that merely exploit the random sampling of the data to form the training data. I wrote a paper about this as it can be problematic if you have a large number of hyper-parameters to tune:

G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection and subsequent selection bias in performance evaluation, Journal of Machine Learning Research, 2010. Research, vol. 11, pp. 2079-2107, July 2010. (pdf)

If you only have a few hyper-parameters to tune, this isn't usually too much of a problem, as long as you are aware that the minimum cross-validation error will be optimistically biased. If you need an unbiased (or at least less biased or pessimistically biased) estimate, then the thing to do is nested cross-validation, where the outer cross-validation estimates performance and the inner cross-validation is used to tune the hyper-parameters separately in each fold (see the paper for details). Basically tuning the hyper-parameters is part of the fitting of the model and needs to be cross-validated as well. Of course this is computationally even more expensive.

To reduce the computational expense, you could try using the Nelder-Mead simplex method for tuning the hyper-parameters instead of grid search, it is usually more efficient and doesn't need gradient information. Pattern search is another alternative. Another thing you can do to improve efficiency is to start the model for each grid point from the model found at the last grid point, instead of starting from scratch again (alpha seeding). Alternatively you could use a "regularisation path" type algorithm to learn each row (where the regularisation parameter is being changed with fixed kernel parameter) in one go.