Neural Networks – Are There Any Weight Matrices of Residual Connections in ResNet?: Detailed Insight

neural networksresidual-networksweights

In the resnet and its variants, such as (taken from here)
enter image description here

Do these shortcut connections have any weight matrices (and bias) associated with them or do they simply copy the same output and transfer it to another point of the network (to be summed up)?

Best Answer

There are two cases in the ResNet paper.

  1. When shortcut connections where the summands have the same shape, the identity mapping is used, so there is no weight matrix.

  2. When the summands would have different shapes, then there is a weight matrix that has the purpose of projecting the shortcut output to be the same shape as the direct output.

From the ResNet paper Kaiming He et al., "Deep Residual Learning for Image Recognition"

We adopt residual learning to every few stacked layers. A building block is shown in Fig. 1. Formally, in this paper we consider a building block defined as: \begin{equation}\label{eq:identity} y= \mathcal{F}(x, \{W_{i}\}) + x. \end{equation} Here $x$ and $y$ are the input and output vectors of the layers considered. The function $\mathcal{F}(x, \{W_{i}\})$ represents the residual mapping to be learned. For the example in Fig. 2 that has two layers, $\mathcal{F}=W_{2}\sigma(W_{1}{x})$ in which $\sigma$ denotes ReLU (Nair 2010) and the biases are omitted for simplifying notations. The operation $\mathcal{F}+{x}$ is performed by a shortcut connection and element-wise addition. We adopt the second nonlinearity after the addition (i.e., $\sigma({y}),$ see Fig. 2).

The shortcut connections in Eqn.(1) introduce neither extra parameter nor computation complexity. This is not only attractive in practice but also important in our comparisons between plain and residual networks. We can fairly compare plain/residual networks that simultaneously have the same number of parameters, depth, width, and computational cost (except for the negligible element-wise addition).

The dimensions of ${x}$ and $\mathcal{F}$ must be equal in Eqn. 1. If this is not the case (\eg, when changing the input/output channels), we can perform a linear projection $W_{s}$ by the shortcut connections to match the dimensions: \begin{equation}\label{eq:transform} {y}= \mathcal{F}({x}, \{W_{i}\}) + W_{s}{x}. \end{equation} We can also use a square matrix $W_{s}$ in Eqn.1. But we will show by experiments that the identity mapping is sufficient for addressing the degradation problem and is economical, and thus $W_{s}$ is only used when matching dimensions.