for the most part i disagree with yukiJ... please read on.
1) procedure almost OK. however, note that you can report the out of sample performance for all models. however, if you then choose the best model based on these numbers, the out of sample performance is again contaminated (you use it for learning the best model family). so I suggest you choose the best model overall using your described procedure, and use your test set then to estimate the performance of the best model. but of course it depends on your interest, are you interested in the performance of the best model (then you should take my suggestion), however if you are interested in the performance of all model families and if you do not nececarily care about the best performance, then your approach is perfect.
2 & 3) I would use nested cross validation. what does this mean?
you make folds for your dataset. lets say 10 fold cross validation. use 5 if it takes too long.
now for each fold, do the following. you construct a trainset and testset. since im telling you for the first fold, lets call them train1, and test1. so you have obtained 2 sets from the complete dataset.
now, you are going to make 10 folds in the dataset train1. so essentially, now you perform cross validation on train1 to determine the best model overall, or the best model from each model family. so in words:
use train1, and for the first fold of train1: make 2 datasets, train1_1 and test1_1 (note, train1_1 is contained in train1, and so iss test1_1). train all models on train1_1 and evaluate them all on test1_1.
then you go to the second fold, so from train1 you make two new datasets train1_2 and test1_2. (note they both are contained in train1).
do the same and evaluate on test1_2.
etc. etc.
then at the end you have evaluated them 10 times, using the dataset train1. now you find the best performing model of each model family (by averaging all test scores), or simply the best model overall (by averaging all test scores). you have to store which model performs the best on the dataset train1 ;). now you have the best performing models, you evaluate them on the still unused testset test1. (remember, test1 is NOT contained in train1).
now you repeat the same for the second fold of the complete dataset. so you form a dataset train2, test2. you again create the following datasets from train2:
train2_1, test2_1
train2_2, test2_2,
...
train2_10, test2_10,
etc. repeat the same procedure. you get the best performing models (which may be different models then you found from the dataset train1! so thats why you need to store which model(s) perform the best). once you have the best performing models, evaluate on test2.
etc. etc. repeat
so in the end, you get this 10 times:
fold 1: model X performed best with out of sample performance Z1
fold 2: model A performed best with ... Z2
...
fold 10: model X performed best with .... Z10.
using this method, you get most out of your data. why?
you see if there is large variance. does the optimal model really depend on the fold, or is it always the same? in the first case the model might be overfitting on the fold. in the latter case you should be happy since the model is performing consistently good, indicating it really does generalize well!
you get the variance of the out of sample performance of the model.
by reusing the data in this way, you do not contaminate folds, yet you get to use your data multiple times.
also, choosing the best model using this way is more robust, since you use averaged test scores to choose models, so in this way you also avoid overfitting during model selection.
if you get large variance numbers you should of course still watch out. you could indeed make plots of the gridsearch in the intermediate model selection steps to see whether or not model selection makes any sense, or maybe your grid is way too small or much too large.
if the variance numbers are also very nice, you can also instead of averaging them, you can do the following. lets say you have 2 models, and 4 folds:
fold 1 fold 2 fold 3 fold 4
model A 2 3 5 2 avg: 3
model B 2.2 3.2 3 2.2 avg: 2.65
lets say that for this performance measure, higher is better.
you see on average, model A performs better (looking at the average score).
however, for 3 / 4 folds, model B performs better than model A.
thus you might say that model B was simply unlucky.
so for choosing the best model, you can also use majority voting using the test score. here B would get 3 votes (since it performs the best 3 times) and A gets 1 vote. so after this model selection step, you will select model B.
this can be useful if you have high variance. since by averaging you loose a lot of information (since model A and B use the same folds in each cross val step - pay attention that you do use the same folds!) if you use the same folds in each step, you can directly compare these numbers (fold1, fold2).
to reduce variance, you can repeat the whole above procedure again. simply choose new random cross validation folds, and repeat everything. then you can also get a better idea of whether or not your results are robust.
3) typically, more folds are better, but they are a big computational burden, thats why people typically use only 5 or 10 folds. keep in mind with my above described procedure it will take 100 train + test sets! (10 x 10 = 100).
there is one downside. if you use more folds, variance of your performance on the test set will increase. since the test set is smaller, you average over less samples. however, at least when you have more folds your training set is bigger, and the model will typically perform better (or closer to the performance you would get if you would use all samples for training).
in the end what matters most is your preference. do you want an accurate estimate of the generalization performance? use less folds, big test sets, your estimate will be closer to the real number. however, the bigger your test set, the worse your model will perform since you have less training samples. so you are not reaching your full potential. so instead, i suggest my method, where you have small test sets, but you average performance of the test sets ;).
if I were you, I would look into how to run your computations in parallel, since it will save you a lot of time. at least be sure to save your results during the process if it takes long, so if you have a crash you do not have to start from scratch.
i did not have a lot of time writing and formatting my answer, please forgive me. if anything is unclear let me know and i will get back to you.
PS: for your case I would use a fairly complex model (assuming classification) I would go for KNN (1 parameter) or SVM with RBF kernel (2 parameters). for regression, perhaps looking into robust regression makes sense, but it depends on your performance measure. lets say you measure performance using MSE, I would simply use regular linear regression plus regularization (L2, L1) and perhaps a kernel (polynomial, RBF).
I figured out where my understanding was off, figured I should answer my question in case anyone else stumbles upon it.
To start, sklearn makes nested cross-validation deceptively easy. I read their example over and over but never got it until I looked at the extremely helpful pseudocode given in the answer to this question.
Briefly, this is what I had to do (which is almost a copy of the example scikit-learn gives):
- Initialize two cross-validation generators, inner and outer. For this, I used the StratifiedKFold() constructor.
- Create a RandomizedSearchCV object (so much quicker than the whole grid search—I think one can easily use sklearn objects to calculate the Bayesian Information Criterion and make an even cooler/faster/smarter hyperparameter optimizer, but this is beyond my knowledge, I just heard Andreas Mueller talk about it in some lecture once) giving the inner cross-validator as the cv parameter, and the rest of your stuff, estimators, scoring function, etc. as normal.
- Fit this to your training set (X) and labels (y). You want to fit this because you'll need an estimator for the next step (i.e., the estimator you get after transforming X and y using estimators in your pipeline + fitting X and y using the final estimator to produce a fitted estimator).
- Use cross_val_score and give it your newly-fitted RandomizedSearchCV object, X, y, and the outer cross-validator. I assigned the outputs from this into a variable called scores and I returned a tuple consisting of a tuple with the best score and best parameters given by the randomized search (rs._best_params, rs._best_score) and the scores variable. I'm a little fuzzy on what exactly I needed and got a bit lazy, so this might be more information returned than necessary.
In code, this is kind of how it looks:
def nestedCrossValidation(X, y, pipe, param_dist, scoring, outer, inner):
rs = RandomizedSearchCV(pipe, param_dist, verbose=1, scoring=scoring, cv=inner)
rs.fit(X, y)
scores = cross_val_score(rs, X, y, cv=outer)
return ((rs._best_score, rs.best_params), scores)
cross_val_score will split into a training/test set and do a randomized search on that training set, which itself splits into a test/training set, generates the scores, then goes back up to cross_val_score to test and move on to the next test/training set.
AFTER you do this, you'll get a bunch of cross-validation scores. My original question was: "what do you get/do now?" Nested cross-validation is not for model selection. What I mean by that, is that you're not trying to get parameter values that are good for your final model. That's what the inner RandomizedSearchCV is for.
But of course, if you are using something like a RandomForest for feature selection in your pipeline, then you'd expect a different set of parameters each time! So what do you really get that's useful?
Nested cross-validation is to give an unbiased estimate as to how good your methodology/series of steps is. What is "good"? Good is defined by the stability of hyperparameters and the cross-validation scores you ultimately get. Say you get numbers like I did: I got cross-validation scores of: [0.57027027, 0.48918919, 0.37297297, 0.74444444, 0.53703704]. So depending on the mood of my method of doing things, I can get an ROC score between 0.37 and 0.74 — obviously this is undesirable. If you were to look at my hyper-parameters, you'd see that the "optimal" hyper-parameters vary wildly. Whereas if I got consistent cross-validation scores that were high, and the optimal hyper-parameters were all in the same ballpark, I can be fairly confident that the way I am choosing to select features and model my data is pretty good.
If you have instability—I am not sure what you can do. I'm still new to this—the gurus on this board probably have better advice other than blindly changing your methodology.
But if you have stability, what's next? This is another important aspect that I neglected to understand: a really good and predictive and generalizable model created by your training data is NOT the final model. But it's close. The final model uses all of your data, because you're done testing and optimizing and tweaking (yeah, if you'd try and cross-validate a model with data you used to fit it, you'd get a biased result, but why would you cross-validate it at this point? You've already done that, and hopefully a bias issue doesn't exist)—you give it all the data you can so it can make the most informed decisions it can, and the next time you'll see how well your model does, is when it's in the wild, using data that neither you nor the model has ever seen before.
I hope this helps someone. For some reason it took me a really long time to wrap my head around this, and here are some other links I used to understand:
http://www.pnas.org/content/99/10/6562.full.pdf — A paper that re-examines data and conclusions drawn by other genetics papers that don't use nested cross-validation for feature selection/hyper-parameter selection. It's somewhat comforting to know that even super smart and accomplished people also get swindled by statistics from time to time.
http://jmlr.org/papers/volume11/cawley10a/cawley10a.pdf — iirc, I've seen an author to this author answer a ton of questions about this topic on this forum
Training with the full dataset after cross-validation? — One of the aforementioned authors answering a similar question in a more colloquial manner.
http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html — the sklearn example
Best Answer
It is recommended to hold out a test set that the model only sees at the end, but not during the parameter tuning and model selection steps.
Grid search with cross-validation is especially useful to performs these steps, this is why the author only uses the train data.
If you use your whole data for this step, you will have picked a model and a parameter set that work best for the whole data, including the test set. Hence, this is prone to overfitting.
Usually it is recommended to either: