Solved – Common values used for hyperparameter grid search

algorithmsmachine learningoptimizationscikit learn

When performing a grid search for exploring optimal hyperparameters, what are the typical values, or ranges, that are commonly used for alpha, epsilon, gamma, lambda, C, etc? I'm not focused on a particular algorithm at the moment. And I realize each problem is going to be different. But just wondering if anybody has a resource they can share that provides typical values or ranges that would be good starting points for hyperparameter tuning.

Best Answer

This is going to be very dependent on the algorithm, but I can answer for glmnet.

Glmnet is fit through a cyclical coordinate descent, we apply an update rule to each coordinate in turn and iterate until convergence. The update rule is equation 5 in this paper.

As the authors note, this update rule implies that, if the some parameter $\beta_j$ is zero at some iteration, then it stays zero after the iteration if and only if

$$ \left| \langle x_j, y \rangle \right| < N \lambda \alpha $$

where $N$ is the number of training data points. If we let $\tilde \lambda$ denote the minimum lambda for which all parameters are estimated as zero (the set of all such lambdas is clearly an unbounded interval including infinity), the above relation tells us that

$$ N \alpha \tilde \lambda = \max_j \left| \langle x_j, y \rangle \right| $$

that is, all parameters are estimated as zero when

$$ \lambda > \frac{1}{N \alpha} \max_j \left| \langle x_j, y \rangle \right| $$

This gives us bounds on our grid search

When $ \lambda = 0$ we get the un-regularized model.
When $ \lambda > \tilde \lambda $ all parameters are zero.

To fill in the middle, glmnet breaks up this interval into multiplicatively even subintervals, i.e., subintervals whose endpoints are equally spaced on the logarithmic scale. My assumption is that this was done given some empirical evidence, but I'm not really sure.

Related Solutions

Solved – Is an SVM’s (maximum) likelihood uniquely defined as a function of hyperparameters

The paragraph in question refers to that document's equation (13): $$ \ln P(\theta \mid D) = - \frac12 \sum_{x, x'} \theta(x) K^{-1}(x, x') \theta(x') - C \sum_i l(y_i \, \theta(x_i)) + \text{const} $$ where $K^{-1}(x, x')$ is the element of the inverse kernel matrix corresponding to points $x$ and $x'$, i.e. it actually depends on the full dataset.

You can divide through by $C$ without changing the location of the maximum, which implies that the solution depends only on $\mathbf K^{-1} / C$ (where $\mathbf K$ denotes the full training set kernel matrix), or equivalently on $C \mathbf K$.

You can also see this directly from the dual objective function:

$$ \max_\alpha \sum_i \alpha_i - \frac12 \sum_{i, j} \alpha_i \alpha_j y_i y_j k(x_i, x_j) \\\text{s.t.}\; \forall i. 0 \le \alpha_i \le C, \; \sum_i \alpha_i y_i = 0 $$

Consider replacing each $\alpha_i$ by $c \alpha_i$, $k(x_i, x_j)$ with $k(x_i, x_j) / c$, and changing the first constraint to $0 \le \alpha_i \le c C$. Then the new objective function is $$ \sum_i c \alpha_i - \frac12 \sum_{i, j} (c \alpha_i) (c \alpha_j) y_i y_j \frac{k(x_i, x_j)}{c} = c \left( \sum_i \alpha_i - \frac12 \sum_{i, j} \alpha_i \alpha_j y_i y_j k(x_i, x_j) \right) ,$$ the unaltered constraint is still satisfied, and so the optimal $\alpha$ is then scaled by $c$. Since predictions are based on $\alpha^T \mathbf k$, where $\mathbf k$ is the vector of test point kernels, replacing it with $c \alpha^T \mathbf k/c$ doesn't change the predictions.

If we're using a Gaussian RBF kernel $K(x, x') = \exp\left(- \gamma \lVert x - x' \rVert^2 \right)$, however, it is not true that this means we can explore only a single $C \gamma = \text{const}$ curve: $K$'s dependence on $\gamma$ is nonlinear. Doubling $\gamma$ and halving $C$ will result in a different $C \mathbf K$ matrix.

Here's some empirical evidence (in Python) on some arbitrary synthetic data:

from sklearn import svm, datasets, cross_validation
X, y = datasets.make_classification(
          n_samples=1000, n_features=10, random_state=12)
Xtrain, Xtest, ytrain, ytest = cross_validation.train_test_split(X, y)
Cs = [.01, .1, 1, 10, 100]

print("RBF:")
for C in Cs:
    acc = svm.SVC(C=C, gamma=1/C, kernel='rbf') \
             .fit(Xtrain, ytrain).score(Xtest, ytest)
    print("{:5}: {:.3}".format(C, acc))

print("\nScaled linear:")
for C in Cs:
    acc = svm.SVC(C=C, gamma=1/C, kernel=lambda x, y: x.dot(y.T) / C)  \
             .fit(Xtrain, ytrain).score(Xtest, ytest)
    print("{:5}: {:.3}".format(C, acc))

which prints

RBF:
 0.01: 0.84
  0.1: 0.504
    1: 0.832
   10: 0.86
  100: 0.868

Scaled linear:
 0.01: 0.88
  0.1: 0.88
    1: 0.88
   10: 0.88
  100: 0.88

The first set of numbers is for an RBF kernel with varying $C$ and $\gamma = 1/C$; because we get different classification accuracies, the predictions are clearly different.

The second is for a scaled linear kernel $K(x, x') = \gamma x^T x'$, in which case (because the dependence on $\gamma$ is linear) your reasoning is correct and all the predictions are the same.

Hyperparameter Optimization – Practical Guide to Random vs. Grid Search

Random search has a probability of 95% of finding a combination of parameters within the 5% optima with only 60 iterations. Also compared to other methods it doesn't bog down in local optima.

Check this great blog post at Dato by Alice Zheng, specifically the section Hyperparameter tuning algorithms.

I love movies where the underdog wins, and I love machine learning papers where simple solutions are shown to be surprisingly effective. This is the storyline of “Random search for hyperparameter optimization” by Bergstra and Bengio. [...] Random search wasn’t taken very seriously before. This is because it doesn’t search over all the grid points, so it cannot possibly beat the optimum found by grid search. But then came along Bergstra and Bengio. They showed that, in surprisingly many instances, random search performs about as well as grid search. All in all, trying 60 random points sampled from the grid seems to be good enough.

In hindsight, there is a simple probabilistic explanation for the result: for any distribution over a sample space with a finite maximum, the maximum of 60 random observations lies within the top 5% of the true maximum, with 95% probability. That may sound complicated, but it’s not. Imagine the 5% interval around the true maximum. Now imagine that we sample points from his space and see if any of it lands within that maximum. Each random draw has a 5% chance of landing in that interval, if we draw n points independently, then the probability that all of them miss the desired interval is $\left(1−0.05\right)^{n}$. So the probability that at least one of them succeeds in hitting the interval is 1 minus that quantity. We want at least a .95 probability of success. To figure out the number of draws we need, just solve for n in the equation:

$$1−\left(1−0.05\right)^{n}>0.95$$

We get $n\geqslant60$. Ta-da!

The moral of the story is: if the close-to-optimal region of hyperparameters occupies at least 5% of the grid surface, then random search with 60 trials will find that region with high probability.

You can improve that chance with a higher number of trials.

All in all, if you have too many parameters to tune, grid search may become unfeasible. That's when I try random search.

Best Answer

Related Solutions

Solved – Is an SVM’s (maximum) likelihood uniquely defined as a function of hyperparameters

Hyperparameter Optimization – Practical Guide to Random vs. Grid Search

Related Question