There are two great recent articles on some of the geometric properties of deep neural networks with piecewise linear nonlinearities (which would include the ReLU activation):
- On the Number of Linear Regions of Deep Neural Networks by Montufar, Pascanu, Cho and Bengio.
- On the number of response regions of deep feed forward networks with piece-wise linear activations by Pascanu, Montufar and Bengio.
They provide some badly needed theory and rigor when it comes to neural networks.
Their analysis centers around the idea that:
deep networks are able to separate their input space into exponentially more linear response regions than their shallow counterparts, despite using the same number of computational units.
Thus we may interpret deep neural networks with piecewise linear activations as partitioning the input space into a bunch of regions, and over each region is some linear hypersurface.
In the graphic you have referenced, notice that the various (x,y)-regions have linear hypersurfaces over them (seemingly either slanted planes or flat planes). So we see the hypothesis from the above two articles in action in your referenced graphics.
Furthermore they state (emphasis from the co-authors):
deep networks are able to identify an exponential number of input neighborhoods by mapping them to a common output of some intermediary hidden layer. The computations carried out on the activations of this intermediary layer are replicated many times, once in each of the identified neighborhoods. This allows the networks to compute very complex looking functions even when they are defined with relatively few parameters.
Basically this is the mechanism that allows deep networks to have incredibly robust and diverse feature representations despite having a fewer number of parameters than their shallow counterparts. In particular, the deep neural networks can learn an exponential number of these linear regions. Take for example, Theorem 8 from the first referenced paper, which states:
Theorem 8: A maxout network with $L$ layers of width $n_0$ and rank $k$ can compute functions with at least $k^{L-1}k^{n_0}$ linear regions.
This is again for deep neural networks with piecewise linear activations, like ReLUs for example. If you used sigmoid-like activations, you would have smoother sinusoidal looking hypersurfaces. A lot of researchers now use ReLUs or some variation of ReLUs (leaky ReLUs, PReLUs, ELUs, RReLUs, the list goes on) because their piecewise linear structure allows for better gradient backpropagation vs the sigmoidal-units which can saturate (have very flat/asymptotic regions) and effectively kill gradients.
This exponentiality result is crucial, otherwise the piecewise linearity might not be able to efficiently represent the types of nonlinear functions we must learn when it comes to computer vision or other hard machine learning tasks. However, we do have this exponentiality result and therefore these deep networks can (in theory) learn all sorts of nonlinearities by approximating them with a huge number of linear regions.
As for your question about the hypersurface: you can absolutely setup a regression problem where your deep net tries to learn the $y = f(x_1, x_2)$ hypersurface. This is tantamount to just using a deep net to setup a regression problem, many deep learning packages can do this, no problem.
If you want to just test your intuition, there's a lot of great deep learning packages available these days: Theano (Lasagne, No Learn and Keras built on top of it), TensorFlow, a bunch of others I'm sure I'm leaving out. These deep learning packages will compute the backpropagation for you. However, for a smaller scale problem like the one you mentioned it really is a good idea to code up the backpropagation yourself, just to do it once, and learn how to gradient check it. But like I said, if you just want to try it out and visualize it, you can get started pretty quickly with these deep learning packages.
If one is able to properly train the network (we use enough data points, initialize it properly, training goes well, this is its own whole other issue to be frank), then one way to visualize what our network has learned, in this case, a hypersurface, is to just graph our hypersurface over an xy-mesh or grid and visualize it.
If the above intuition is correct, then using deep nets with ReLUs, our deep net will have learned an exponential number of regions, each region having its own linear hypersurface. Of course, the whole point is that because we have exponentially many, the linear approximations can become so fine and we do not perceive the jagged-ness of it all, given that we used a deep/large enough network.
Best Answer
Let's denote $f$ the true underlying function and $\hat f$ the function that your machine learning algorithm converges too ($\hat f$ belongs to a family of parametrized functions $F$). For simplicity, let's also assume that $f$ can be expressed analytically and that $f$ is deterministic.
I assume that by "practice", you mean with machine learning (using experimental data) and by "theory", you mean modelling mathematically without machine learning (without data).
In practice, if you have enough data and if $F$ contains $f$, then it should be possible to obtain $\hat f$ = $f$ with an appropriate machine learning methodology.
Theoretically, you may try to model $f$ with physical laws (or other modelling laws). For example is $f(p,s)$ models the time it takes for an object of shape s and weight p to fall from the top of the Eiffel tower, you can use classical mechanics (assuming they are true in the scope/scale of $f$) to model $f$.
For apple and oranges, $f$ is subjective to a particular person (given an ambiguous picture, two persons may disagree). So let's consider your $f$. $f$ is then defined by your brain! So if we assume that there exists an analytical expression of $f$, here are the two ways to find it:
To recap, you can usually find $f$ but it is really hard in both cases: