Why is it difficult to find the Global Optimum

global optimizationoptimization

When studying Calculus I learnt that it it possible to take the derivative of a function to find its minimum and maximum points.

I then wondered what happens if there are more than one minimum and maximum in a function and remembered that optimization algorithms exist. For example, Hill Climbing, Genetic Algorithms and Simulated Annealing.

These all try to find the Global optimum but may not return the best answer as they use some randomness. A quick search on Wikipedia shows that there are many Optimization algorithms.

Is there any intuitive explanation as to why Optimization is difficult and requires a whole field of study?

What are the main challenges of finding the Global Extrema?

Why is it not always possible to determine the Global Minimum and Maximum from the derivative?

Best Answer

There is a powerful result called the Extreme Value Theorem: any continuous function $f:X \rightarrow Y$ on a compact set $X$ achieves a maximum and minimum on the set, and $X$ contains the maximizer and minimizer. At any local maximizer $x^*$, $f$ is either non-differentiable or $(x'-x^*)'\nabla f(x^*) \le 0$ for all $x' \in X$ (and similarly for minimizers). From an analytical perspective, the problem is fairly straightforward.

The problem is really computational. Computing the set on which $f$ is non-differentiable and the set for which $(x'-x^*)'\nabla f(x^*) \le 0$ for all $x' \in X$ can be very difficult, especially in high dimensional spaces. Computers generally do not do symbolic computation, so numerical estimates of gradients and Hessians can be unreliable. If you can prove that your updating rule for the guess of the extremum is a Banach-type contraction, there is a unique global solution, but if not, your updating rule might actually move away from a local minimum or maximum you are searching for. For example, Newton's method has great convergence properties, but only if you start by assuming you are in a "basin of attraction" of the local extremum.

Another way to think about it is Sard's Theorem. The set of critical points of a sufficiently differentiable map is measure zero. That sounds nice, because it means the number of things you are looking for is small as long as your function has enough curvature. But in a high dimensional space, this is like looking for needles in a haystack, and the chance that you draw one at random is essentially zero. So without knowing the function you are maximizing is roughly concave or convex, there is very little clue about where to start, and if you pick initial guesses at random, it is almost impossible to start at a solution, and you are at the mercy of the convergence properties of your algorithm.

It's really pretty miserable. I have seen many people at the beginning of their careers write down very complex models, assuming that just because you can write a model down you must be able to solve it on a computer, and come back six months later after learning all kinds of parallel computing and simulated annealing and neural network tools with the realization that computers are not magical and their usefulness is actually much more limited than people realize.

Best Answer

Related Solutions

[Math] Why don’t people do simulated annealing before gradient descent

Related Question