Machine Learning – How to Choose a Parameter Grid for Model’s Hyperparameter Tuning

hyperparametermachine learningtuning

How do people decide on the ranges for hyperparameters to tune?

For example, I am tuning an xgboost model, I've been following a guide on kaggle to set the ranges of each hyperparameter to then do a bayesian optimisation gridsearch. E.g. this guide lists the typical values for max_depth of xgboost as 3-10 – how is this range decided on as typical?

I chose to read about the hyperparameter and set:

xgb_parameters = {
    'max_depth':  (1, 4), ... }

I chose 1-4 as I read a large depth is more computationally expensive and an overfitting risk, but I have no reason for choosing 4 to be the end of my range specifically.

Is there a resource or paper I should be referring to to understand the ranges of hyperparameters for any models I'm interested in? Or is it generally set by trial and error depending on your prediction problem? Or am I worrying too much about needing exact reasoning for ranges and as long as I have a reason for my general range its acceptable?

Best Answer

this guide lists the typical values for max_depth of xgboost as 3-10 - how is this range decided on as typical?

You guessed it correct. Typical means in most problems, this parameter is chosen something in between 3-10. This is based on experimentation, trial and error and a just rough guide. Another guide could have said 3-12, and it wouldn't be a wrong guide. This is problem and data dependent. And, there are computational considerations as well. Maybe the business problem is not that sensitive and you wouldn't want to wait a whole day just to get a 0.5% improvement.

However, although there aren't set in stone rules for choosing max depth candidates (at least to the best of my knowledge), you can choose reasonable values/ranges based on your number of features, and dataset size. For instance, if your dataset has around 1000 samples, and if each split is assumed to split the space into half, in 10 splits, your leaves would be left with only one or two samples, which seems like overfitting (not necessarily but it looks like). You can come up with some bounds following this kind of logic.

Related Question