Solved – Using genetic algorithm for hyperparameter optimization

genetic algorithmshyperparametermachine learning

In machine learning, I've learned one of the ways to optimize hyperparameters of a model is to do a grid search, which tests model for evenly spaced out values of hyperparametrs and determines which combination gives best results on validation set.

Since space represented by hyperparameters and efficiency of the model can have multiple local optimas, would it make sense to use some metaheuristic search method, like genetic algorithm?

Our gene could be a binary sequence representing hyperparameter values, and our individual's fitness function could be score of the model for hyperparameters represented by it's genetic material.

Major flaw that comes into my mind is that model has to be trained over and over again, which would take a lot of time, but is there any way to compensate for that by doing less training iterations? Would that affect our result a lot? Also, when doing grid search, model also has to be trained for every hyperparameter combination, so maybe the time difference between the two wouldn't be that big?

What do you think?

Best Answer

You can use genetic algorithms. Yes, it will require to rerun experiments again and again but it is also true for other hyperparameter optimization methods. You can try to use warm-starts, i.e., don't train your models from scratch but to warm-start them from some previously found solutions. The latter sometimes is used for deep neural networks when searching for networks architectures.

Genetic algorithms can potentially be slow compared to other methods. However, they are relatively easy to adjust for any search space. The first GAs for hyperparameter tuning appeared about 30 years ago. For a more recent work, see "Large-Scale Evolution of Image Classifiers" by Real et al., 2017 at https://arxiv.org/abs/1703.01041

Related Solutions

Solved – Grid Search for hyperparameter and feature selection

The most important downside for searching along single parameters instead of optimizing them all together is that you ignore interactions. It is quite common that e.g. more than one parameter influences model complexity. In that case, you need to look at the interaction in order to sucessfully optimize the hyperparameters.

Depending on how large your data set is and how many models you compare, optimization strategies that return the maximum observed performance run into trouble (true for both grid search and your strategy). The reason is that searching through a large number of performance estimates for the maximum "skims" the variance of the performance estimate: you may just end up with a model and train/test split combination that accidentally happens to look good. Even worse, you may get several perfect looking combinations, and the optimization then cannot know which model to choose and thus becomes unstable.

Solved – Nested Cross-Validation for Feature Selection and Hyperparameter Optimization

I figured out where my understanding was off, figured I should answer my question in case anyone else stumbles upon it.

To start, sklearn makes nested cross-validation deceptively easy. I read their example over and over but never got it until I looked at the extremely helpful pseudocode given in the answer to this question.

Briefly, this is what I had to do (which is almost a copy of the example scikit-learn gives):

Initialize two cross-validation generators, inner and outer. For this, I used the StratifiedKFold() constructor.
Create a RandomizedSearchCV object (so much quicker than the whole grid search—I think one can easily use sklearn objects to calculate the Bayesian Information Criterion and make an even cooler/faster/smarter hyperparameter optimizer, but this is beyond my knowledge, I just heard Andreas Mueller talk about it in some lecture once) giving the inner cross-validator as the cv parameter, and the rest of your stuff, estimators, scoring function, etc. as normal.
Fit this to your training set (X) and labels (y). You want to fit this because you'll need an estimator for the next step (i.e., the estimator you get after transforming X and y using estimators in your pipeline + fitting X and y using the final estimator to produce a fitted estimator).
Use cross_val_score and give it your newly-fitted RandomizedSearchCV object, X, y, and the outer cross-validator. I assigned the outputs from this into a variable called scores and I returned a tuple consisting of a tuple with the best score and best parameters given by the randomized search (rs._best_params, rs._best_score) and the scores variable. I'm a little fuzzy on what exactly I needed and got a bit lazy, so this might be more information returned than necessary.

In code, this is kind of how it looks:

def nestedCrossValidation(X, y, pipe, param_dist, scoring, outer, inner):
    rs = RandomizedSearchCV(pipe, param_dist, verbose=1, scoring=scoring, cv=inner)
    rs.fit(X, y)
    scores = cross_val_score(rs, X, y, cv=outer)
    return ((rs._best_score, rs.best_params), scores)

cross_val_score will split into a training/test set and do a randomized search on that training set, which itself splits into a test/training set, generates the scores, then goes back up to cross_val_score to test and move on to the next test/training set.

AFTER you do this, you'll get a bunch of cross-validation scores. My original question was: "what do you get/do now?" Nested cross-validation is not for model selection. What I mean by that, is that you're not trying to get parameter values that are good for your final model. That's what the inner RandomizedSearchCV is for.

But of course, if you are using something like a RandomForest for feature selection in your pipeline, then you'd expect a different set of parameters each time! So what do you really get that's useful?

Nested cross-validation is to give an unbiased estimate as to how good your methodology/series of steps is. What is "good"? Good is defined by the stability of hyperparameters and the cross-validation scores you ultimately get. Say you get numbers like I did: I got cross-validation scores of: [0.57027027, 0.48918919, 0.37297297, 0.74444444, 0.53703704]. So depending on the mood of my method of doing things, I can get an ROC score between 0.37 and 0.74 — obviously this is undesirable. If you were to look at my hyper-parameters, you'd see that the "optimal" hyper-parameters vary wildly. Whereas if I got consistent cross-validation scores that were high, and the optimal hyper-parameters were all in the same ballpark, I can be fairly confident that the way I am choosing to select features and model my data is pretty good.

If you have instability—I am not sure what you can do. I'm still new to this—the gurus on this board probably have better advice other than blindly changing your methodology.

But if you have stability, what's next? This is another important aspect that I neglected to understand: a really good and predictive and generalizable model created by your training data is NOT the final model. But it's close. The final model uses all of your data, because you're done testing and optimizing and tweaking (yeah, if you'd try and cross-validate a model with data you used to fit it, you'd get a biased result, but why would you cross-validate it at this point? You've already done that, and hopefully a bias issue doesn't exist)—you give it all the data you can so it can make the most informed decisions it can, and the next time you'll see how well your model does, is when it's in the wild, using data that neither you nor the model has ever seen before.

I hope this helps someone. For some reason it took me a really long time to wrap my head around this, and here are some other links I used to understand:

http://www.pnas.org/content/99/10/6562.full.pdf — A paper that re-examines data and conclusions drawn by other genetics papers that don't use nested cross-validation for feature selection/hyper-parameter selection. It's somewhat comforting to know that even super smart and accomplished people also get swindled by statistics from time to time.

http://jmlr.org/papers/volume11/cawley10a/cawley10a.pdf — iirc, I've seen an author to this author answer a ton of questions about this topic on this forum

Training with the full dataset after cross-validation? — One of the aforementioned authors answering a similar question in a more colloquial manner.

http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html — the sklearn example

Best Answer

Related Solutions

Solved – Grid Search for hyperparameter and feature selection

Solved – Nested Cross-Validation for Feature Selection and Hyperparameter Optimization

Related Question