You do indeed have things backwards. Arbitrary width refers to a neural net with fixed input and output layers, and an intermediate layer than can be made as large as necessary. Arbitrary depth refers to a neural net with some fixed upper bound on the number of neurons in a layer but arbitrarily many layers.
George Cybenko proved the first version of the arbitrary width theorem for sigmoid activation functions in 1989 (from the Wikipedia page). The current version of the arbitrary width theorem is valid for any non-polynomial activation function, and essentially states that any continuous function can be approximated to an arbitrary degree of precision by some three layer neural net with a sufficiently large intermediate layer.
The major results for the arbitrary depth case are much more recent, and seem to have been mostly proven in the late 2010s. There are a couple variants of the arbitrary depth case, one for the ReLU activation function (https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) and a good bound on the width, and one for a more general activation function with a weaker bound on width. They both say essentially that, given input and output layers of fixed size, there is a some bound on width $B$ such that any continuous function can can be approximated by a neural net with at most $B$ neurons in each layer and sufficiently many layers.
When one set, say $X,$ is a dense subset of another, say $Y$, it means that any element of $Y$ can be approximated arbitrarily close by some element of $X$, not that all elements of $X$ are close to all elements of $Y$. In our case, each of the universal approximation theorems states that a certain space of functions generated by neural nets of either arbitrary width-bounded depth or bounded width-arbitrary depth is dense in the space of continuous functions from let's say $\mathbb{R}^n$ to $\mathbb{R}^m$.
I think that there is some confusion in your question. Once we have that settled, everything will presumably be more clear.
- (Minor) Be careful not to be confused between continuous-time dynamical systems and discrete-time dynamical systems. Their functional form is similar but the meaning of the terms is quite different. When you have $\dot{h} = f(h)$ that $f$ represents the derivative of the system. The actual trajectory will have a different expression (the integral of that function) that we can write $x(t) = F(x(0), t)$. This would be valid for any $t$, so we can write it as $x(t+\tau) = F(x(t), \tau)$. The discrete-time dynamical system is actually an approximation of this (where you basically fixed a $\tau$ as your unitary time).
- Addressing now your question, we start by considering where that definition using the Jacobian comes from. From wikipedia
Many parts of the qualitative theory of differential equations and
dynamical systems deal with asymptotic properties of solutions and the
trajectories—what happens with the system after a long period of time.
The simplest kind of behavior is exhibited by equilibrium points, or
fixed points, and by periodic orbits. If a particular orbit is well
understood, it is natural to ask next whether a small change in the
initial condition will lead to similar behavior. Stability theory
addresses the following questions: Will a nearby orbit indefinitely
stay close to a given orbit? Will it converge to the given orbit? In
the former case, the orbit is called stable; in the latter case, it is
called asymptotically stable and the given orbit is said to be
attracting.
An equilibrium solution $f_e$ to an autonomous system of first order
ordinary differential equations is called: stable if for every (small)
$\epsilon > 0$, there exists a $\delta > 0 $ such that every solution
$f(t) $ having initial conditions within distance $ \delta $ i.e. $ \|
> f(t_0) - f_e \| < \delta$ of the equilibrium remains within distance $
> \epsilon $ i.e. $\| f(t) - f_e \| < \epsilon$ for all $ t \ge t_0
> $.
(Note that $f$ here refers to what I've called F above, i.e., the solution of the system).
So people have been wondering how to understand whether an orbit is stable, and among many criteria, one of the most used ones is the linearization of the system around points of the orbit. Again from wiki:
One of the key ideas in stability theory is that the qualitative
behavior of an orbit under perturbations can be analyzed using the
linearization of the system near the orbit. In particular, at each
equilibrium of a smooth dynamical system with an n-dimensional phase
space, there is a certain n×n matrix A whose eigenvalues characterize
the behavior of the nearby points (Hartman–Grobman theorem).
This is why the Jacobian helps us determine the stability: we are basically asking, "what happens if I'm not exactly on that point but really close to it?".
- You have probably noticed that this definition of stability is fairly similar to yours, with the difference that in your definitions, the two trajectories are generic (not orbit). But we can show that the definition you are using is actually again determined by the Jacobian. Let us define
$$ \Delta(t) := y(t)-h(t) $$
As the (small) difference between the two trajectories. This means that $y= \Delta + h$.
We have
$$
\dot\Delta = f(y) - f(h)
$$
And since $\Delta$ is small, we can linearize the function around $h$
$$
f(y)=f(h+Delta) \approx f(h) + Jf(h) \cdot \Delta
$$
Which implies
$$
\dot\Delta = f(h) + Jf(h) \cdot \Delta - f(h) = Jf(h) \cdot \Delta
$$
This means that (locally in time and space) a neighbor trajectory of $h(t)$ will approach it or not based on the eigenvalues of the Jacobian matrix.
- Unfortunately, all of this only applies to autonomous dynamical systems. When we are input-driven systems (or, in general, non-autonomous systems), things can get really complicated, and, in fact, simple things - like defining what an attractive point is - tend to become difficult. See, for example, this paper.
- The two papers you linked (along with many others) are interested into this kind of stability as is it linked to the stability of the backrpropagation. The paper about AntisymmetricRNN explains this really nicely.
- When you say:
In applications for example NLP you usually do not have a value for h0
and set it to 0 so if your RNN is for sentiment analysis for instance
and your RNN output a sentiment class (happy, bad, etc)
it will purely depend on the input and not on the initial hidden state.
This is clearly correct, and to understand why the notion of stability is still important for your case, imagine that you have an input $u(t)$ and one neural network $\dot{h} = f(h(t),u(t); \theta)$ where $\theta$ is the set of parameters you can tune. What happens when we change the parameters (e.g., go through one training epoch) and we have a new set $\theta'$?
This will give a new dynamics, namely: $\dot{h'} = f(h'(t),u(t); \theta')$ and we may ask ourselves: will the dynamics of this new model stay close to the old one when receiving the same input (in NLP, "when reading the same sentence?")? This is a really important question, as the state in which the network will be at the end of the sentence (let's call it $h(T)$) will affect the label assigned to that sentence!
- Practically, to improve learning, what you want is to change the behavior of the network for the signals (sentences) that are misclassified, but keep the same behavior for signals that were correct. This is, of course, really hard to approach with theory, but I hope that you now get the idea.
This is already a long answer, but I feel that I've skipped a lot of the saucy details. I've also been really informal in the math, as I only wanted to deliver the main ideas: things are much more complicated than that!
Best Answer
Polynomial regression is just usually the wrong Bayesian prior. You need functions with highly "non-local" effects which require high-degree polynomials, but polynomial regression gives zero prior probabilities to high-degree polynomials. As it turns out, neural networks happen to provide a reasonably good prior (perhaps that's why our brains work that way -- if they even do).