Solved – Grid search with cross validation: should I keep a separate test set

cross-validationmodel selectionsupervised learning

There is a lot of information on using cross validation and grid search, and there is also confusion about the test set in this situation.

I have a labeled data set that I am using to build a predictive model. I will perform a model selection across a family of models (SVC, NB, random forests, boosting etc.) and then do parameter tuning using grid search.

However, I could not find a definitive answer about whether I need to keep a separate test set for estimating out-of-sample performance.

This is my intended plan:

  1. set aside the test set (and not touch it until the very end to avoid data contamination)
  2. on the remaining data: perform grid search for all models and estimate their performance using k-fold cross validation
  3. select the best model from each family of models and test their out-of-sample performance using the test set
  4. report results

However, there are several challenges that I am not sure how to deal with:

  • The data set is rather small: around 550 points for 30 variables
  • The data set is noisy and I was told that I could get
    significantly different results for the same exact model with the same hyper-parameters on different random splits of data
    and that is a big challenge

Given these two challenges, I have three questions:

  1. Does my 4-step procedure seem right?
  2. How do I deal with the second challenge of inconsistent performance?
  3. Due to the small size of data, what should be the size of the test set and how many fold is it better to have in the cross-validation step?

Best Answer

for the most part i disagree with yukiJ... please read on.

1) procedure almost OK. however, note that you can report the out of sample performance for all models. however, if you then choose the best model based on these numbers, the out of sample performance is again contaminated (you use it for learning the best model family). so I suggest you choose the best model overall using your described procedure, and use your test set then to estimate the performance of the best model. but of course it depends on your interest, are you interested in the performance of the best model (then you should take my suggestion), however if you are interested in the performance of all model families and if you do not nececarily care about the best performance, then your approach is perfect.

2 & 3) I would use nested cross validation. what does this mean?

you make folds for your dataset. lets say 10 fold cross validation. use 5 if it takes too long.

now for each fold, do the following. you construct a trainset and testset. since im telling you for the first fold, lets call them train1, and test1. so you have obtained 2 sets from the complete dataset.

now, you are going to make 10 folds in the dataset train1. so essentially, now you perform cross validation on train1 to determine the best model overall, or the best model from each model family. so in words:

use train1, and for the first fold of train1: make 2 datasets, train1_1 and test1_1 (note, train1_1 is contained in train1, and so iss test1_1). train all models on train1_1 and evaluate them all on test1_1.

then you go to the second fold, so from train1 you make two new datasets train1_2 and test1_2. (note they both are contained in train1).

do the same and evaluate on test1_2.

etc. etc.

then at the end you have evaluated them 10 times, using the dataset train1. now you find the best performing model of each model family (by averaging all test scores), or simply the best model overall (by averaging all test scores). you have to store which model performs the best on the dataset train1 ;). now you have the best performing models, you evaluate them on the still unused testset test1. (remember, test1 is NOT contained in train1).

now you repeat the same for the second fold of the complete dataset. so you form a dataset train2, test2. you again create the following datasets from train2:

train2_1, test2_1

train2_2, test2_2,

...

train2_10, test2_10,

etc. repeat the same procedure. you get the best performing models (which may be different models then you found from the dataset train1! so thats why you need to store which model(s) perform the best). once you have the best performing models, evaluate on test2.

etc. etc. repeat

so in the end, you get this 10 times:

fold 1: model X performed best with out of sample performance Z1

fold 2: model A performed best with ... Z2

...

fold 10: model X performed best with .... Z10.

using this method, you get most out of your data. why?

  • you see if there is large variance. does the optimal model really depend on the fold, or is it always the same? in the first case the model might be overfitting on the fold. in the latter case you should be happy since the model is performing consistently good, indicating it really does generalize well!

  • you get the variance of the out of sample performance of the model.

  • by reusing the data in this way, you do not contaminate folds, yet you get to use your data multiple times.

  • also, choosing the best model using this way is more robust, since you use averaged test scores to choose models, so in this way you also avoid overfitting during model selection.

if you get large variance numbers you should of course still watch out. you could indeed make plots of the gridsearch in the intermediate model selection steps to see whether or not model selection makes any sense, or maybe your grid is way too small or much too large. if the variance numbers are also very nice, you can also instead of averaging them, you can do the following. lets say you have 2 models, and 4 folds:

         fold 1     fold 2     fold 3  fold 4
model A  2          3          5       2      avg: 3
model B  2.2        3.2        3       2.2    avg: 2.65

lets say that for this performance measure, higher is better. you see on average, model A performs better (looking at the average score). however, for 3 / 4 folds, model B performs better than model A. thus you might say that model B was simply unlucky. so for choosing the best model, you can also use majority voting using the test score. here B would get 3 votes (since it performs the best 3 times) and A gets 1 vote. so after this model selection step, you will select model B. this can be useful if you have high variance. since by averaging you loose a lot of information (since model A and B use the same folds in each cross val step - pay attention that you do use the same folds!) if you use the same folds in each step, you can directly compare these numbers (fold1, fold2).

to reduce variance, you can repeat the whole above procedure again. simply choose new random cross validation folds, and repeat everything. then you can also get a better idea of whether or not your results are robust.

3) typically, more folds are better, but they are a big computational burden, thats why people typically use only 5 or 10 folds. keep in mind with my above described procedure it will take 100 train + test sets! (10 x 10 = 100). there is one downside. if you use more folds, variance of your performance on the test set will increase. since the test set is smaller, you average over less samples. however, at least when you have more folds your training set is bigger, and the model will typically perform better (or closer to the performance you would get if you would use all samples for training).

in the end what matters most is your preference. do you want an accurate estimate of the generalization performance? use less folds, big test sets, your estimate will be closer to the real number. however, the bigger your test set, the worse your model will perform since you have less training samples. so you are not reaching your full potential. so instead, i suggest my method, where you have small test sets, but you average performance of the test sets ;).

if I were you, I would look into how to run your computations in parallel, since it will save you a lot of time. at least be sure to save your results during the process if it takes long, so if you have a crash you do not have to start from scratch.

i did not have a lot of time writing and formatting my answer, please forgive me. if anything is unclear let me know and i will get back to you.

PS: for your case I would use a fairly complex model (assuming classification) I would go for KNN (1 parameter) or SVM with RBF kernel (2 parameters). for regression, perhaps looking into robust regression makes sense, but it depends on your performance measure. lets say you measure performance using MSE, I would simply use regular linear regression plus regularization (L2, L1) and perhaps a kernel (polynomial, RBF).

Related Question