Solved – What does dotted line mean in ResNet

neural networks

I understand the basic idea of residual neural network (ResNet), just copy $a^{[l]}$ to the add operator at layer $[l+2]$ before the ReLU operator at layer $[l+2]$.

This image shows part of ResNet, which seems to uses 2 types of skip connections, represented by
solid line and dotted line respectively.

I would just like to know which one is the copy-paste to $[l+2]$ operation, the solid line or the dotted line.

A post says

The dotted line is there, precisely because there has been a change in the dimension of the input volume (of course a reduction because of the convolution).

What does that mean? Does the dotted line pointed out by red arrow reduce from 64 to 128? I can't understand this. Please help.

Here is the Figure 2. Residual learning: a building block, coming from the ResNet paper.

Best Answer

It's best to understand the model in terms of individual "Residual" blocks that stack up and result in the entire architecture. As you would have probably noticed, the dotted connections only come up at a few places where there is an increase in the depth (number of channels and not the spatial dimensions). In this case, the first dotted arrow of the network presents the case where the depth is increased from 64 to 128 channels by 1x1 convolution.

Consider equation (2) of the ResNet paper: $$ y = F(\textbf{x}, \{W_i\}) + W_s \textbf{x} $$ This is used when the dimensions of the mapping function $F$ and the identity function $\textbf{x}$ do not match. The way this is solved is by introducing a linear projection $W_s$. Particularly, as described in page 4 of the Resnet paper, the projection approach means that 1x1 convolutions are performed such that the spatial dimensions remain the size but the number of channels can be increased/decreased (thereby, affecting the depth). See more about 1x1 convolutions and their use here. However, another method of matching the dimensions without having an increase in the number of parameters across the skip connections is to use what is the padding approach. Here, the input is first downsampled by using 1x1 pooling with a stride 2 and then padded with zero channels to increase the depth.

Here is what the paper precisely mentions:

When the dimensions increase (dotted line shortcuts in Fig. 3), we consider two options: (A) The shortcut still performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no extra parameter; (B) The projection shortcut in Eqn.(2) is used to match dimensions (done by 1×1 convolutions).

Here are some more references, in case needed - a Reddit thread, another SE question on similar lines.

Related Solutions

Solved – My ReLU network fails to launch

The problem is that output units will sometimes get pushed into a regime where they are "dead" - that is, not responsive to any input. Once they're dead, they're dead - all gradients from that unit become zero, and therefore they can not learn to be useful again.

This is a well-known problem with ReLU units. As work-arounds, some folks have designed alternative activation functions which are largely similar, but do not have a flat gradient. The Leaky ReLU function $L$ is probably the simplest

$$ L(x) = \begin{cases} x &\text{if}~x > 0\\ \alpha x &\text{otherwise} \end{cases} $$

where $0 < \alpha< 1$ is some constant chosen by the user, usually something like $\alpha=0.2$. This always has positive gradient, so the weights can always update.

They can more easily be pushed into this regime when the learning rate is higher.

This thread addresses your observation about larger learning rates.

the reason that the larger gradient flowing through an ReLU neuron can cause it to die

Solved – Gradient backpropagation through ResNet skip connections

Add sends the gradient back equally to both inputs. You can convince yourself of this by running the following in tensorflow:

import tensorflow as tf

graph = tf.Graph()
with graph.as_default():
    x1_tf = tf.Variable(1.5, name='x1')
    x2_tf = tf.Variable(3.5, name='x2')
    out_tf = x1_tf + x2_tf

    grads_tf = tf.gradients(ys=[out_tf], xs=[x1_tf, x2_tf])
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        fd = {
            out_tf: 10.0
        }
        print(sess.run(grads_tf, feed_dict=fd))

Output:

[1.0, 1.0]

So, the gradient will be:

passed back to previous layers, unchanged, via the skip-layer connection, and also
passed to the block with weights, and used to update those weights

Edit: there is a question: "what is the operation at the point where the highway connection and the neural net block join back together again, at the bottom of Figure 2?"

There answer is: they are summed. You can see this from Figure 2's formula:

$$ \mathbf{\text{output}} \leftarrow \mathcal{F}(\mathbf{x}) + \mathbf{x} $$

What this says is that:

the values in the bus ($\mathbf{x}$)
are added to the results of passing the bus values, $\mathbf{x}$, through the network, ie $\mathcal{F}(\mathbf{x})$
to give the output from the residual block, which I've labelled here as $\mathbf{\text{output}}$

Edit 2:

Rewriting in slightly different words:

in the forwards direction, the input data flows down the bus
- at points along the bus, residual blocks can learn to add/remove values to the bus vector
in the backwards direction, the gradients flow back down the bus
- along the way, the gradients update the residual blocks they move past
- the residual blocks will themselves modify the gradients slightly too

The residual blocks do modify the gradients flowing backwards, but there are no 'squashing' or 'activation' functions that the gradients flow through. 'squashing'/'activation' functions are what causes the exploding/vanishing gradient problem, so by removing those from the bus itself, we mitigate this problem considerably.

Edit 3: Personally I imagine a resnet in my head as the following diagram. Its topologically identical to figure 2, but it shows more clearly perhaps how the bus just flows straight through the network, whilst the residual blocks just tap the values from it, and add/remove some small vector against the bus:

Best Answer

Related Solutions

Solved – My ReLU network fails to launch

Solved – Gradient backpropagation through ResNet skip connections

Related Question