How can a computer algorithm optimize a discontinuous function

algorithmsoptimizationsoft-question

Let function $f : [0,1] \to \Bbb R$ be defined by

$$f (x) := \begin{cases} \left| x – \frac{1}{\pi} \right| & \text{ if } x \neq \frac{1}{\pi}\\ 10 & \text{ if } x = \frac{1}{\pi}\end{cases}$$

Can things like $$\max_{x \in [0,1]} f(x)$$ be solved by a computer algorithm? Is it possible to prove that no computer algorithm is capable of optimizing this type of functions?

Background: we are trying to optimize $\phi: \Delta \to \mathbb N$ where $\Delta$ stands for the collection of all probability measures. Of course any algorithm can only handle finite numbers so we have to cut $\Delta$ into steps. However, we cannot exclude the possibility that the optima lie within the steps.

I've looked into some literature on applying mollification or smooth approximation, but my $f(x)$ might just become a simple absolute value function after "mollification" because the point $1/\pi$ is measure zero and takes no weight.

Best Answer

It depends on what you mean by "optimize" or "solve". Essentially, there are (at least) two distinct notions of a computer accessing a function --- white box, and black box.

Black box access is somewhat more common (within theoretical computer science at least). You view the computer as having some "oracle" $\mathcal{O}(\cdot)$ which it can query. It has no idea of the inner workings of the oracle, but it can send the oracle inputs, and receive outputs back from the oracle. You can view this as having some API for some function, but having absolutely no clue how it is implemented.

White box access means that you give the computer "the code of the function" in some way. Maybe (theoretically) you give it $M$, the description of a Turing Machine (or some other model of computation) which computes the function. Maybe (practically) you give the computer access to the file in which the function is defined.

We'll discretize things to make things simpler (from a computation perspective). Let $f : D \to R$ be the function you want to optimize. Then, for an arbitrary discontinuous function $f$, black-box optimization takes at least $|D|$ queries to the oracle, i.e. you must check all points.

The proof is as follows. Say you have some optimization function which queries the oracle for the function $f$ at points $D' \subsetneq D$. Let $x\in D\setminus D'$ be a point which is not queried. Then the optimization function is incorrect on the function $f'$, which is the same as $f$ on $D'$, but has $f(x) = 1 + \max_{x\in D} f(x)$. Note that this (simplistic) proof stops working if your optimization function is randomized, but I believe that your general intuition still holds true that one cannot optimize arbitrary discontinuous functions.

In the white box model there are things you can do, but it is highly dependent on how the function is presented. Your function (for example) can be represented via a tree, where each internal node you compute some predicate of the input then branch left or right as a result (this is formalized via the notion of a decision tree). While I do not know the state of art for optimizing functions represented as decision trees, there's the following simple algorithm that shows you get "more power" than the black box query model --- each leaf of the tree corresponds to an output. Sort them. Then loop through the (sorted) leaves, and check if there exists an input which satisfies all predicates on the internal nodes on the path to that leaf. The first leaf for which there does exist such an input is the maximum value.

As for why this is "more power", technically the amount of power depends on the complexity of the underlying decision tree. But for your function things reduce to analyzing a binary tree of depth two (which is quite simple). How long the black-box model takes depends on how exactly you discretize things, but if you wanted to "query all floating point numbers" (for example) it suffices to say it would take a while.

Note that for more general notions of computation (say Turing machines) little is possible for the "standard reasons" (the Turing machines may not even halt). That being said, if you know that the function is "nice enough", you could imagine doing something similar to what I described with decision trees for the control flow graph of your function. Decision trees control flow graph is especially simple --- it's a tree (instead of a general graph), and there is no mutation of variables.

There may be thorny details when one steps away from this discretized world that I just discussed, but I (unfortunately) cannot comment on those.

Related Solutions

[Math] Computer Algorithm Running Time Calculation

I don't know if another answer can be more illuminating to you, but let me try saying it a bit differently anyway.

First, consider the inner loop:

    for k= 1 to m
        //dowork  ---- CONSTANT

Because it's doing a constant amount of work $m$ times (for k=1 to m), this takes approximately time $m$, whatever the value of $m$ is. (To be precise, it takes $\Theta(m)$ time, which means that there are constants $c_1$ and $c_2$ such that the time it takes is between $c_1m$ and $c_2m$, but when finding the "approximate running time", i.e., the asymptotic complexity, we usually ignore constant factors.)

Now the outer loop looks like

m = n
while (m > 0)
      //Do 'm' amount of work
      m = floor(m/2)

where I've replaced the inner loop with what we know about its time. So the first time the loop is run, $m=n$ and it takes time $n$. By the second time the loop is run, $m$ is halved, so $m = n/2$ and it takes time $n/2$ (I'm ignoring writing $\lfloor n/2 \rfloor$, because that's within a constant factor of $n/2$.) The third time it takes $n/4$, etc. So the total time taken by the code is: $$\begin{align}&n + \frac{n}{2} + \frac{n}{4} + \frac{n}{8} + \dots \\ &= n\left(1 + \frac12 + \frac14 + \frac18 + \dots\right)\end{align}$$ until the term becomes less than $1$ (so $m$ would have become $0$ and the code would have halted).

Now, the sum $\left(1 + \frac12 + \frac14 + \frac18 + \dots\right)$ is at most $2$, so the above sum for the running time is at most $2n$. Ignoring constant factors, we can say that the code takes approximately time $n$, which is shorthand for saying that the time it takes is is linear in $n$. Or, if we did the whole thing formally, we would have said it takes time $\Theta(n)$.

(As it happens, we can analyse the number of terms in the sum: if the last term is $\frac{n}{2^k}$ (each term is of this type), then $k$ is such that $2^k \le n < 2^{k+1}$, which means $k = \lfloor \lg n \rfloor$, but all this is irrelevant to the actual problem here.)

How to optimize an experimentally derived objective function

You are describing an "online black-box optimization problem". I suspect that in your case, online bayesian optimization would be a good tool. Here's an algorithm/paper that you may find useful:

McIntire M, Ratner D, Ermon S. Sparse Gaussian Processes for Bayesian Optimization. In UAI 2016 Jun 25. https://www.auai.org/uai2016/proceedings/papers/269.pdf

And here's their GitHub repo:

https://github.com/ermongroup/bayes-opt

In case the above is not useful to you, I'll throw around some other ideas/keywords which may help with your search.

There are many ways you can tackle this type of problem, but the tool you choose depends on a few factors. For example, if you are dealing with a nonconvex function and there are unsatisfactory local minima, then your problem becomes a global optimization one, which I assume is your case. If your function is highly dimensional (think hundreds or thousands of dimensions), then some methods may require excessive function evaluations to find a good solution. Given you are manually tuning an experimental procedure, I'll assume that the number of parameters is around 10-ish. I'll also assume that it takes significant time to evaluate $f(x)$ for a given $x$; in these low-dimensional but highly expensive continuous problems, Bayesian Optimization is often a good solution. If you have a mix of discrete and continuous variables, you could make a function that maps a continuous domain to your discrete variables. If you have purely discrete domain, then you may need to investigate combinatorial optimizers.

Best Answer

Related Solutions

[Math] Computer Algorithm Running Time Calculation

How to optimize an experimentally derived objective function

Related Question