I think that there is some confusion in your question. Once we have that settled, everything will presumably be more clear.
- (Minor) Be careful not to be confused between continuous-time dynamical systems and discrete-time dynamical systems. Their functional form is similar but the meaning of the terms is quite different. When you have $\dot{h} = f(h)$ that $f$ represents the derivative of the system. The actual trajectory will have a different expression (the integral of that function) that we can write $x(t) = F(x(0), t)$. This would be valid for any $t$, so we can write it as $x(t+\tau) = F(x(t), \tau)$. The discrete-time dynamical system is actually an approximation of this (where you basically fixed a $\tau$ as your unitary time).
- Addressing now your question, we start by considering where that definition using the Jacobian comes from. From wikipedia
Many parts of the qualitative theory of differential equations and
dynamical systems deal with asymptotic properties of solutions and the
trajectories—what happens with the system after a long period of time.
The simplest kind of behavior is exhibited by equilibrium points, or
fixed points, and by periodic orbits. If a particular orbit is well
understood, it is natural to ask next whether a small change in the
initial condition will lead to similar behavior. Stability theory
addresses the following questions: Will a nearby orbit indefinitely
stay close to a given orbit? Will it converge to the given orbit? In
the former case, the orbit is called stable; in the latter case, it is
called asymptotically stable and the given orbit is said to be
attracting.
An equilibrium solution $f_e$ to an autonomous system of first order
ordinary differential equations is called: stable if for every (small)
$\epsilon > 0$, there exists a $\delta > 0 $ such that every solution
$f(t) $ having initial conditions within distance $ \delta $ i.e. $ \|
> f(t_0) - f_e \| < \delta$ of the equilibrium remains within distance $
> \epsilon $ i.e. $\| f(t) - f_e \| < \epsilon$ for all $ t \ge t_0
> $.
(Note that $f$ here refers to what I've called F above, i.e., the solution of the system).
So people have been wondering how to understand whether an orbit is stable, and among many criteria, one of the most used ones is the linearization of the system around points of the orbit. Again from wiki:
One of the key ideas in stability theory is that the qualitative
behavior of an orbit under perturbations can be analyzed using the
linearization of the system near the orbit. In particular, at each
equilibrium of a smooth dynamical system with an n-dimensional phase
space, there is a certain n×n matrix A whose eigenvalues characterize
the behavior of the nearby points (Hartman–Grobman theorem).
This is why the Jacobian helps us determine the stability: we are basically asking, "what happens if I'm not exactly on that point but really close to it?".
- You have probably noticed that this definition of stability is fairly similar to yours, with the difference that in your definitions, the two trajectories are generic (not orbit). But we can show that the definition you are using is actually again determined by the Jacobian. Let us define
$$ \Delta(t) := y(t)-h(t) $$
As the (small) difference between the two trajectories. This means that $y= \Delta + h$.
We have
$$
\dot\Delta = f(y) - f(h)
$$
And since $\Delta$ is small, we can linearize the function around $h$
$$
f(y)=f(h+Delta) \approx f(h) + Jf(h) \cdot \Delta
$$
Which implies
$$
\dot\Delta = f(h) + Jf(h) \cdot \Delta - f(h) = Jf(h) \cdot \Delta
$$
This means that (locally in time and space) a neighbor trajectory of $h(t)$ will approach it or not based on the eigenvalues of the Jacobian matrix.
- Unfortunately, all of this only applies to autonomous dynamical systems. When we are input-driven systems (or, in general, non-autonomous systems), things can get really complicated, and, in fact, simple things - like defining what an attractive point is - tend to become difficult. See, for example, this paper.
- The two papers you linked (along with many others) are interested into this kind of stability as is it linked to the stability of the backrpropagation. The paper about AntisymmetricRNN explains this really nicely.
- When you say:
In applications for example NLP you usually do not have a value for h0
and set it to 0 so if your RNN is for sentiment analysis for instance
and your RNN output a sentiment class (happy, bad, etc)
it will purely depend on the input and not on the initial hidden state.
This is clearly correct, and to understand why the notion of stability is still important for your case, imagine that you have an input $u(t)$ and one neural network $\dot{h} = f(h(t),u(t); \theta)$ where $\theta$ is the set of parameters you can tune. What happens when we change the parameters (e.g., go through one training epoch) and we have a new set $\theta'$?
This will give a new dynamics, namely: $\dot{h'} = f(h'(t),u(t); \theta')$ and we may ask ourselves: will the dynamics of this new model stay close to the old one when receiving the same input (in NLP, "when reading the same sentence?")? This is a really important question, as the state in which the network will be at the end of the sentence (let's call it $h(T)$) will affect the label assigned to that sentence!
- Practically, to improve learning, what you want is to change the behavior of the network for the signals (sentences) that are misclassified, but keep the same behavior for signals that were correct. This is, of course, really hard to approach with theory, but I hope that you now get the idea.
This is already a long answer, but I feel that I've skipped a lot of the saucy details. I've also been really informal in the math, as I only wanted to deliver the main ideas: things are much more complicated than that!
Best Answer
There are many different types of RNNs with different governing equations. In general, they are written with the update rule $$ (a_t,h_t) = f_r(h_{t-1},x_{t-1}|\theta) $$ for which the rule you state is simply a "special case" with $f_r(h_t,x_t|\theta)=h_t + f(h_t,x_t|\theta)$ and ignoring the output-per-timestep $a_t$. The loop comes from iterating the process via $t=1,\ldots,n$. In other words, by "unfolding" the RNN. The equation you have is very general, since you can write $f(h_t,x_t) = g(h_t,x_t) - h_t$, meaning you're not very limited.
Also, as you said,
But for RNNs this is typically the case since $h_t$ is viewed as a "hidden state" that we repeatedly update given the new data at time $t$, denoted $x_t$, and the old hidden state (memory) $h_t$, to get $h_{t+1}$.
For residual networks, we typically do something like $$ x_\text{out} = F_\phi(x_\text{in}) + s_\theta(x_\text{in}) = F_\phi(x_\text{in}) + W_\theta x_\text{in} $$ where $x_\text{in}\in \mathbb{R}^{N_1}$ and $x_\text{out}\in \mathbb{R}^{N_2}$, so that $s_\theta(x) = x$ (or $W_\theta=I$) if $N_1 = N_2$. When the sizes do not match, the skip connection becomes a linear projection, in other words. (See the original resnet paper for details).
Let's suppose $N_1 = N_2$ so we have a skip connection rather than a projection. Then rewrite $x_\text{out} = h_{t+1}$, $x_\text{in} = h_{t}$, and $F_\phi(x_\text{in}) =: f(h_t,\theta_t)$ with $\phi=\theta_t$. This gives us $$ x_\text{out} = h_{t+1} = F_\phi(x_\text{in}) + x_\text{in} = f(h_t,\theta_t) + h_{t} $$ as desired. Skip connections between non-consecutive layers can be done by cramming two iterations into one operation, and "renaming" $t$ so that both layers are performed in one time step. Then the skip connection will produce the $h_t$ term and the operations of the intermediate layers can be encapsulated in $f$.
From what I can tell, your definition is general enough to cover most network types. Basically, you condition the function on all of the outputs of the intermediate layers $x_\ell$, yes? If so, then the only thing missing is a layer-specific input. For instance, in many RNNs (e.g., for reinforcement learning), we not only pass along and update a hidden state $h_t$, but we also receive an additional time-specific input, $u_t$ (often denoted $x_t$ but I'm trying to avoid confusing myself) and output an addition time-specific output, say $a_t$ (like performing an action at each timestep in a game).
However, these are special cases of your formula. Note that $x_{k+1} = F_k(x_0,\ldots,x_k)$, since we can merely lump $u_t$ and $a_t$ into $x_t$ and $x_{t+1}$. We cover RNNs by acting in a Markov manner and ignoring older values. We cover (non-consecutive) skip connections by virtue of conditioning on all the previous values.