Definition of a neural network

artificial intelligencemachine learningneural networks

I need a definition of neural networks in terms of mathematical mapping syntax. Since neural networks come in all different shapes I find it a little hard to come up with a definition that comprises all different types. For the case of feed forward networks without cycles, loops and skip connections a solid definition could be

$ \mathbb{R}^{n_0} \ni x_0 \mapsto x_N \in \mathbb{R}^{n_N}$ where
$x_{k+1} = F_k(x_k)$ and $F_k: \mathbb{R}^{n_k} \to \mathbb{R}^{n_{k+1}} $.

In the case where forward skip connections are admissable, I think the definition would change to

$ \mathbb{R}^{n_0} \ni x_0 \mapsto x_N \in \mathbb{R}^{n_N}$ where
$x_{k+1} = F_k(x_0,x_1,…,x_k)$ and $F_k: \mathbb{R}^{\sum n_k} \to \mathbb{R}^{\sum n_k + n_{k+1}} $

In the following paper "Neural Ordinary Differential Equations" the authors state

Models such as residual networks, recurrent neural network decoders, and normalizing flows build complicated transformations by composing a sequence of transformations to a hidden state:
$h_{t+1} = h_t + f(h_t,\theta_t)$

This can obviously only be true for networks which same dimension in each layer.

Also other papers that I read claimed that residual networks (i.e. networks containing skip connections) can be represented in the above way. My questions are

  1. How can it be that recurrent neural network (i.e. networks contaning loops and/or cycles) can be represented in the way from the block quote ? Where happens the loop in the formula ?
  2. Also, how can residual networks with skip connections between non-consecutive layers be written in that way ? Where do we see the skip connections in the formula ?
  3. How would I need to change the second definition I gave above in order to fit most network types ?

Best Answer

  1. There are many different types of RNNs with different governing equations. In general, they are written with the update rule $$ (a_t,h_t) = f_r(h_{t-1},x_{t-1}|\theta) $$ for which the rule you state is simply a "special case" with $f_r(h_t,x_t|\theta)=h_t + f(h_t,x_t|\theta)$ and ignoring the output-per-timestep $a_t$. The loop comes from iterating the process via $t=1,\ldots,n$. In other words, by "unfolding" the RNN. The equation you have is very general, since you can write $f(h_t,x_t) = g(h_t,x_t) - h_t$, meaning you're not very limited.

    Also, as you said,

    This can obviously only be true for networks which same dimension in each layer.

    But for RNNs this is typically the case since $h_t$ is viewed as a "hidden state" that we repeatedly update given the new data at time $t$, denoted $x_t$, and the old hidden state (memory) $h_t$, to get $h_{t+1}$.

  2. For residual networks, we typically do something like $$ x_\text{out} = F_\phi(x_\text{in}) + s_\theta(x_\text{in}) = F_\phi(x_\text{in}) + W_\theta x_\text{in} $$ where $x_\text{in}\in \mathbb{R}^{N_1}$ and $x_\text{out}\in \mathbb{R}^{N_2}$, so that $s_\theta(x) = x$ (or $W_\theta=I$) if $N_1 = N_2$. When the sizes do not match, the skip connection becomes a linear projection, in other words. (See the original resnet paper for details).

    Let's suppose $N_1 = N_2$ so we have a skip connection rather than a projection. Then rewrite $x_\text{out} = h_{t+1}$, $x_\text{in} = h_{t}$, and $F_\phi(x_\text{in}) =: f(h_t,\theta_t)$ with $\phi=\theta_t$. This gives us $$ x_\text{out} = h_{t+1} = F_\phi(x_\text{in}) + x_\text{in} = f(h_t,\theta_t) + h_{t} $$ as desired. Skip connections between non-consecutive layers can be done by cramming two iterations into one operation, and "renaming" $t$ so that both layers are performed in one time step. Then the skip connection will produce the $h_t$ term and the operations of the intermediate layers can be encapsulated in $f$.

  3. From what I can tell, your definition is general enough to cover most network types. Basically, you condition the function on all of the outputs of the intermediate layers $x_\ell$, yes? If so, then the only thing missing is a layer-specific input. For instance, in many RNNs (e.g., for reinforcement learning), we not only pass along and update a hidden state $h_t$, but we also receive an additional time-specific input, $u_t$ (often denoted $x_t$ but I'm trying to avoid confusing myself) and output an addition time-specific output, say $a_t$ (like performing an action at each timestep in a game).

    However, these are special cases of your formula. Note that $x_{k+1} = F_k(x_0,\ldots,x_k)$, since we can merely lump $u_t$ and $a_t$ into $x_t$ and $x_{t+1}$. We cover RNNs by acting in a Markov manner and ignoring older values. We cover (non-consecutive) skip connections by virtue of conditioning on all the previous values.

Related Question