Solved – The “Amazing Hidden Power” of Random Search

gradient descentneural networksoptimizationsamplingstochastic-processes

I have the following question that compares random search optimization with gradient descent optimization:

Based on the (amazing) answer provided over here Optimization when Cost Function Slow to Evaluate , I realized something really interesting about Random Search:

Random Search

Even when the cost function is expensive to evaluate, random search
can still be useful. Random search is dirt-simple to implement. The
only choice for a researcher to make is setting the the probability
$p$ that you want your results to lie in some quantile $q$; the rest
proceeds automatically using results from basic probability.

Suppose your quantile is $q = 0.95$ and you want a $p=0.95$
probability that the model results are in top $100\times (1-q)=5$
percent of all hyperparameter tuples. The probability that all $n$
attempted tuples are not in that window is $q^n = 0.95^n$ (because
they are chosen independently at random from the same distribution),
so the probability that at least one tuple is in that region is $1 – 0.95^n$. Putting it all together, we have

$$ 1 – q^n \ge p \implies n \ge \frac{\log(1 – p)}{\log(q)} $$

which in our specific case yields $n \ge 59$.

This result is why most people recommend $n=60$ attempted tuples for
random search. It's worth noting that $n=60$ is comparable to the
number of experiments required to get good results with Gaussian
process-based methods when there are a moderate number of parameters.
Unlike Gaussian processes, the number of queries tuples does not
change with the number of hyperparameters to search over; indeed, for
a large number of hyperparameters, a Gaussian process-based method can
take many iterations to make headway.

Since you have a probabilistic characterization of how good the
results are, this result can be a persuasive tool to convince your
boss that running additional experiments will yield diminishing
marginal returns.

Using random search, you can mathematically show that: regardless of how many dimensions your function has, there is a 95% probability that only 60 iterations are needed to obtain an answer in the top 5% of all possible solutions!

  • Suppose there are 100 possible solutions to your optimization function (this does not depend on the number of dimensions). An example of a solution is $(X_1 = x_1, X_2 = x_2…. X_n = x_n)$.

  • The top 5% of solutions will include the top 5 solutions (i.e. the 5 solutions that provide the 5 lowest values of the function you want to optimize)

  • The probability of at least encountering one of the top 5 solutions in "$n$ iterations" : $\boldsymbol{1 – [(1 – 5/100)^n]}$

  • If you want this probability $= 0.95$, you can solve for $n$: $\boldsymbol{1 – [(1 – 5/100)^n] = 0.95}

  • Thus, $\boldsymbol{n = 60}$ iterations!

But the fascinating thing is, $\boldsymbol{n = 60}$ iterations is still valid regardless of how many solutions exist. For instance, even if 1,000,000,000 solutions exist – you still only need 60 iterations to ensure a 0.95 probability of encountering a solution in the top 5% of all solutions!

  • $\boldsymbol{1 – [(1 – ( (0.05 \times 1000000000) /1000000000 )^{n} )] = 0.95}$

  • "$\boldsymbol{n}$" will still be 60 iterations!

My Question: Based on this amazing "hidden power" of random search, and further taking into consideration that random search is much faster than gradient descent since random search does not require you to calculate the derivatives of multidimensional complex loss functions (e.g., neural networks) : Why do we use gradient descent instead of random search for optimizing the loss functions in neural networks?

The only reason that I can think of, is that if the ranked distribution of the optimization values are "heavily negative skewed", then the top 1% might be significantly better than the top 2%–5%, and the amount of iterations required to encounter a solution in the top 1% will also be significantly larger:

enter image description here

But even with such a distribution of optimization scores, would gradient descent still have its advantages? Does gradient descent (or stochastic gradient descent) really have the ability to compete with this "hidden power" of random search? If certain conditions are met, due to its attractive theoretical properties (e.g., convergence) – does gradient descent have the ability to reach the best solution (not best 5%, but the best solution) much faster than random search? Or in real life applications with non-convex and noisy objective functions, do these "attractive theoretical properties" of gradient descent generally not apply, and once again – the "amazing hidden power" of random search wins again?

In short : Based on this amazing "hidden power" (and speed) of random search, why do we use gradient descent instead of random search for optimizing the loss functions in neural networks?

Can someone please comment on this?

Note: Based on the sheer volume of literature which insists and praises the ability of stochastic gradient descent, I am assuming that stochastic gradient descent does have many advantages compared to random search.

Note: Related Question that resulted from an answer provided to this question: No Free Lunch Theorem and Random Search

Best Answer

  1. One limitation of random search is that searching over a large space is extremely challenging; even a small difference can spoil the result.

    Émile Borel's 1913 article "Mécanique Statistique et Irréversibilité" stated if a million monkeys spent ten hours a day at a typewriter, it's extremely unlikely that the quality of their writing would equal a library's contents. And of course we understand the intuition: language is highly structured (not random), so randomly pressing keys is not going to yield a coherent text. Even a text that is extremely similar to language can be rendered incoherent by a minor error.

    enter image description here

    In terms of estimating a model, you need to estimate everything correctly simultaneously. Getting the correct slope $\hat{\beta}_1$ in the model $y = \beta_0 + \beta_1 x +\epsilon$ is not very meaningful if $\hat{\beta}_0$ is very very far from the truth. In a larger number of dimensions, such as in a neural network, a good solution will need to be found in thousands or millions of parameters simultaneously. This is unlikely to happen at random!

    This is directly related to the curse of dimensionality. Suppose your goal is to find a solution with a distance less than 0.05 from the true solution, which is at the middle of the unit interval. Using random sampling, the probability of this is 0.1. But as we increase the dimension of the search space to a unit square, a unit cube, and higher dimensions, the volume occupied by our "good solution" (a solution with distance from the optimal solution less than 0.05) shrinks, and the probability of finding that solution using random sampling likewise shrinks. (And naturally, increasing the size of the search space but keeping the dimensionality constant also rapidly diminishes the probability.)

    The "trick" to random search is that it purports to defeat this process by keeping the probability constant while dimension grows; this must imply that the volume assigned to the "good solution" increases, correspondingly, to keep the probability assigned to the event constant. This is hardly perfect, because the quality of the solutions within our radius is worse (because these solutions have a larger average distance from the true value).

  2. You have no way to know if your search space contains a good result. The core assumption of random search is that your search space contains a configuration that is “good enough” to solve your problem. If a “good enough “ solution isn’t in your search space at all ( perhaps because you chose too small a region), then the probability of finding that good solution is 0. Random search can only find the top 5% of solutions with positive probability from among the solutions in the search space.

    You might think that enlarging the search space is a good way to increase your odds. While it might make the search region contain an extremely high quality region, but the probability of selecting something in that region shrinks rapidly with increasing size of the search space.

    enter image description here

  3. High-quality model parameters often reside in narrow valleys. When considering hyperparameters, it's often true that the hyperparameter response surface changes only gradually; there are large regions of the space where lots of hyperparameter values are basically the same in terms of quality. Moreover, a small number of hyperparameters make large contributions to improving the model; see Examples of the performance of a machine learning model to be more sensitive to a small subset of hyperparameters than others?

But in terms of estimating the model parameters, we see the opposite phenomenon. For instance, regression problems have likelihoods that are prominently peaked around their optimal values (provided you have more observations than features); moreover, these peaks become more and more "pointy" as the number of observations increases. Peaked optimal values are bad news for random search, because it means that the "optimal region" is actually quite small, and all of the "near miss" values are actually much poorer in comparison to the optimal value.

To make a fair comparison between random search and gradient descent, set a budget of iterations (e.g. the $n=60$ value derived from random search). Then compare model quality of a neural network fit with $n$ iterations of ordinary gradient descent & backprop to a model that uses $n$ iterations of random search. As long as gradient descent doesn't diverge, I'm confident that it will beat random search with high probability.

enter image description here

  1. Obtaining even stronger guarantees rapidly becomes expensive. You can of course adjust $p$ or $q$ to increase the assurances that you'll find a very high quality solution, but if you work out the arithmetic, you'll find that $n$ rapidly becomes very large (that is, random search becomes expensive quickly). Moreover, in a fair comparison, gradient descent will likewise take $n$ optimization steps, and tend to find even better solutions than random search.

Some more discussion, with an intuitive illustration: Does random search depend on the number of dimensions searched?

Related Question