[Math] How to find the formula of a function from its graph

graphing-functionsinterpolation

I've got some data points (X/Y coordinates) that were apparently created using a certain formula that I want to reconstruct now. I've only got those points, and I can plot them (e.g. like in this).
I know how to measure maxima, minima, slopes etc, and could solve an equation to compute the values for parameters in a formula so that it closely approximates/interpolates my data points, but which formula should I use?

I somehow need to find a parametric formula that matches the shape of the (parts of the) graph(s). What are the steps to identify such shapes, and to what formula (parts) do they correspond?

A step function is rather easy to identify, but I'm having problems with curves. Are they polynomials, of what degree? Are they trigonometric functions, like a sinus, squared, cubed? Do they contain a fraction, an exponentiation, something else? Or is it even something completely different, like a bezier curve?

I'm not looking for a solution to a specific problem, but a generic guide, hoping that this is not too broad.

Best Answer

The best way to handle this kind of problem of modelling data is to use information theoretic approaches.

Kolmogorov complexity

The simplest and completely generic way is to find a (program,input) pair with the shortest total length over a fixed general purpose programming language that when run outputs the data exactly as it is. This is called Kolmogorov complexity. Of course this depends on the chosen programming language, and clearly if you first look at the data you can construct a language purposely tailored to it (see Code Golf SE for some egregious examples). That's not allowed; you must first choose the language once and for all, and use it for every subsequent problem that you want to measure the complexity of the data. Also, the language must be Turing-complete. Common languages such as C++14 or Java 7 are suitable. The lengths of the program and input string is always measured in bits.

Let $K_L(x)$ be the Kolmogorov complexity of string $x$ over Turing-complete programming language $L$.

Why would this work? It can be easily proven that, given Turing-complete languages $L,M$, there is some $c$ such that, for any string $x$ we have $|K_L(x)-K_M(x)| \le c$. Thus for any class of strings the Kolmogorov complexity will be asymptotically correct as the string length goes to $\infty$. For example, the Kolmogorov complexity of the string $x_n$ of exactly $n$ zeros would differ from $\log_2(n)$ by at most $c$, for some constant $c$ that is independent of $n$. So as $n \to \infty$ we have $K_L(x_n) \in θ(\log(n))$.

Since our choice of language does not affect asymptotic Kolmogorov complexity, it is an objective measure of complexity for sufficiently large samples, and using a common language like C/C++ or Java makes it impossible for over-fitting.

Suitability of Kolmogorov complexity

For exact data, Kolmogorov complexity captures precisely the amount of information contained within it (up to a constant as mentioned above). However, it can be proven that there is no Turing-complete language $L$ and program $P$ such that $P(x) = K_L(x)$ for every string $x$. So you cannot hope to determine Kolmogorov complexity in general, not to say find the (program,input) pair that witnesses the complexity!

What you can still do is to find an upper bound, which is exactly what compression algorithms are essentially doing. Given any string $x$, the total length of the string that $x$ is compressed to and the decompression algorithm (written in $L$) is an upper bound on the true Kolmogorov complexity $K_L(x)$.

Nevertheless, most scientific experiments do not produce exact data, so it is pointless to ask for a program that reproduces the measurements exactly, since they are already imprecise and inaccurate. But the same ideas still apply. We just need to cater for noise.

Kolmogorov complexity modulo noise

If the data is a list of pairs $(x_k,y_k)_{k\in\{1..n\}}$, such as in many empirical experiments, and we know that the noise is uniform across the list, we can model the data as $N(P(x_k))=y_k$ where program $P$ is the actual function we seek and $N$ is a random program that models a noisy channel that adds noise. $N$ is written in a Turing-complete language that is additionally allowed to call a function that returns a uniformly random bit (which is independent of previous call results).

Then we want to a pair $(P,N)$ that minimizes what I call the complexity modulo noise defined as $|(P,N)| - \frac{1}{n} \sum_{k=1}^n \log_2(\mathbb{P}(N(P(x_k))=y_k))$. Intuitively, this is the length of the description of the trend and the noise source plus the average information in the noise itself. This is such that if there is no meaningful trend mapping $x$ to $y$, then the complexity modulo noise would be just the number of bits required to specify the noise. On the other hand if there is a trend then it would be captured by a short $P$, and there would be an optimal $N$ since it must be noisy enough to cater for the discrepancies but not more than necessary. Also, $N$ would capture the actual noise distribution if the sample is large enough.

Again, it is not possible to deterministically find $P,N$, but we can find upper bounds. Furthermore, this provides a way to compare approximations of an unknown function given a noisy sample of points. If we have 2 polynomials, one with low degree and simple coefficients that fits close enough, and the other with very high degree and complicated coefficients but passing through all the points at the sample resolution, you can be sure that the low degree polynomial will have lower complexity modulo noise. Likewise, you can try as many approximations as you like and simply choose the one that has the lowest complexity modulo noise.

This also allows you to use your intuition to guide your attempts. For real-world measurements, they tend to suffer from Gaussian noise, so you can try $N$ being a normal random variable with various variances (the mean is already captured by $P$).

It should be impossible to over-fit using this method, which is the crucial property that you are looking for.

Related Question