Solved – Common values used for hyperparameter grid search

algorithmsmachine learningoptimizationscikit learn

When performing a grid search for exploring optimal hyperparameters, what are the typical values, or ranges, that are commonly used for alpha, epsilon, gamma, lambda, C, etc? I'm not focused on a particular algorithm at the moment. And I realize each problem is going to be different. But just wondering if anybody has a resource they can share that provides typical values or ranges that would be good starting points for hyperparameter tuning.

Best Answer

This is going to be very dependent on the algorithm, but I can answer for glmnet.

Glmnet is fit through a cyclical coordinate descent, we apply an update rule to each coordinate in turn and iterate until convergence. The update rule is equation 5 in this paper.

As the authors note, this update rule implies that, if the some parameter $\beta_j$ is zero at some iteration, then it stays zero after the iteration if and only if

$$ \left| \langle x_j, y \rangle \right| < N \lambda \alpha $$

where $N$ is the number of training data points. If we let $\tilde \lambda$ denote the minimum lambda for which all parameters are estimated as zero (the set of all such lambdas is clearly an unbounded interval including infinity), the above relation tells us that

$$ N \alpha \tilde \lambda = \max_j \left| \langle x_j, y \rangle \right| $$

that is, all parameters are estimated as zero when

$$ \lambda > \frac{1}{N \alpha} \max_j \left| \langle x_j, y \rangle \right| $$

This gives us bounds on our grid search

  • When $ \lambda = 0$ we get the un-regularized model.
  • When $ \lambda > \tilde \lambda $ all parameters are zero.

To fill in the middle, glmnet breaks up this interval into multiplicatively even subintervals, i.e., subintervals whose endpoints are equally spaced on the logarithmic scale. My assumption is that this was done given some empirical evidence, but I'm not really sure.