I'm using random search for hyper-parameter optimization of a machine learning pipeline. For example, for the C and gamma parameter it is recommended to use logarithmically spaced values. Why should I use such values? For example, if I use logarithmic spaced values from $2^{-5}$ to $2^{15}$, then there will be many more values near to $2^{-5}$ (i.e. near zero) than near to $2^{15}$.
Solved – Why logarithmic scale for hyper-parameter optimization
hyperparametermachine learningoptimization
Related Solutions
For some of these parameters you can simply reason about what the range should be. For example, since the sparsity parameter is the desired average activation of the hidden units, this will only make sense if it is in $(0, 1)$, assuming you're using a sigmoid activation. However, you can still narrow this down farther as a sparsity $> 0.5$ is somewhat nonsensical. For other parameters your can look for the values people commonly report in the literature, use this as a rough guess, and expand your ranges if you find it necessary. For example I would probably start with $(0, 0.2)$ for the learning rate.
Another useful place to look would be Geoff Hinton's practical guide to training restricted Boltzmann machines. I imagine most of this advice would be applicable for autoencoders as well.
Random search has a probability of 95% of finding a combination of parameters within the 5% optima with only 60 iterations. Also compared to other methods it doesn't bog down in local optima.
Check this great blog post at Dato by Alice Zheng, specifically the section Hyperparameter tuning algorithms.
I love movies where the underdog wins, and I love machine learning papers where simple solutions are shown to be surprisingly effective. This is the storyline of “Random search for hyperparameter optimization” by Bergstra and Bengio. [...] Random search wasn’t taken very seriously before. This is because it doesn’t search over all the grid points, so it cannot possibly beat the optimum found by grid search. But then came along Bergstra and Bengio. They showed that, in surprisingly many instances, random search performs about as well as grid search. All in all, trying 60 random points sampled from the grid seems to be good enough.
In hindsight, there is a simple probabilistic explanation for the result: for any distribution over a sample space with a finite maximum, the maximum of 60 random observations lies within the top 5% of the true maximum, with 95% probability. That may sound complicated, but it’s not. Imagine the 5% interval around the true maximum. Now imagine that we sample points from his space and see if any of it lands within that maximum. Each random draw has a 5% chance of landing in that interval, if we draw n points independently, then the probability that all of them miss the desired interval is $\left(1−0.05\right)^{n}$. So the probability that at least one of them succeeds in hitting the interval is 1 minus that quantity. We want at least a .95 probability of success. To figure out the number of draws we need, just solve for n in the equation:
$$1−\left(1−0.05\right)^{n}>0.95$$
We get $n\geqslant60$. Ta-da!
The moral of the story is: if the close-to-optimal region of hyperparameters occupies at least 5% of the grid surface, then random search with 60 trials will find that region with high probability.
You can improve that chance with a higher number of trials.
All in all, if you have too many parameters to tune, grid search may become unfeasible. That's when I try random search.
Best Answer
... because logarithmic scale enables us to search a bigger space quickly. In your SVM example, we do not know the range for the hyper-parameter. So, a quicker way is trying dramatically different values, say, 1, 10, 100, 1000, which come from a logarithmic scale.
In addition, I think log scale search is the first step. Suppose that we found C=10 is better than C=1 or C=100; then we can focus on that scale to try a better value.
Another reason is for "regularization" parameters, such as C in svm. It is not too sensitive. In other words, we may not find too much difference with 10 or 15, or 20, but results would be very different from 10 to 1000. That is why we start with log search.