For some of these parameters you can simply reason about what the range should be. For example, since the sparsity parameter is the desired average activation of the hidden units, this will only make sense if it is in $(0, 1)$, assuming you're using a sigmoid activation. However, you can still narrow this down farther as a sparsity $> 0.5$ is somewhat nonsensical. For other parameters your can look for the values people commonly report in the literature, use this as a rough guess, and expand your ranges if you find it necessary. For example I would probably start with $(0, 0.2)$ for the learning rate.
Another useful place to look would be Geoff Hinton's practical guide to training restricted Boltzmann machines. I imagine most of this advice would be applicable for autoencoders as well.
This is going to be very dependent on the algorithm, but I can answer for glmnet
.
Glmnet is fit through a cyclical coordinate descent, we apply an update rule to each coordinate in turn and iterate until convergence. The update rule is equation 5 in this paper.
As the authors note, this update rule implies that, if the some parameter $\beta_j$ is zero at some iteration, then it stays zero after the iteration if and only if
$$ \left| \langle x_j, y \rangle \right| < N \lambda \alpha $$
where $N$ is the number of training data points. If we let $\tilde \lambda$ denote the minimum lambda for which all parameters are estimated as zero (the set of all such lambdas is clearly an unbounded interval including infinity), the above relation tells us that
$$ N \alpha \tilde \lambda = \max_j \left| \langle x_j, y \rangle \right| $$
that is, all parameters are estimated as zero when
$$ \lambda > \frac{1}{N \alpha} \max_j \left| \langle x_j, y \rangle \right| $$
This gives us bounds on our grid search
- When $ \lambda = 0$ we get the un-regularized model.
- When $ \lambda > \tilde \lambda $ all parameters are zero.
To fill in the middle, glmnet
breaks up this interval into multiplicatively even subintervals, i.e., subintervals whose endpoints are equally spaced on the logarithmic scale. My assumption is that this was done given some empirical evidence, but I'm not really sure.
Best Answer
One nice option, which is at the same time very easy to implement is Random Search for Hyper-Parameter Optimization. It is also implemented in scikit-learn (see section 3.2.2).
The idea is to have an algorithm which yields with high probability an optimal result, but being computationally very efficient (compared to other brute-force or more naive approaches like grid search). How are they able to prune the search space efficiently?.
The basic idea is that not all parameters are equally important to tune in order to get good results (i.e. the optimal solution lies in a low-dimensional submanifold of the search space), so we should see how to focus on that region of interest.
What they do is to model the probability distribution of the hyperparameter space which reflects on the idea that models yielding good results are concentrated around the true value. Now, for any probability distribution with finite maximum the following holds. Take the region around the maximum which accounts for 5% of the total probability. If you take n samples at random. Each one of them has a 0.05 chance of falling within that interval. With probability $(1-0.05)^n$ they all will miss that area.
Now they ask: how many samples n do I need to take so that with high probability (say 95%), at least one of them falls in that region: $$ 1-(1-0.05)^n \ge 0.95 $$ which yields $n \ge 60$. This algorithm has been used with success in deep learning, where the number of parameters if huge.