Solved – Residual network: why is each block learning residual error with respect to identity mapping

deep learningneural networksresidual-networks

In original version of neural network, it learns H(x) from input x, and in the residual network, it is said that learning is improved by learning only residual error with respect to identity mapping, which is F(x)=H(x)-x. Can anyone explain why this will improve learning? What if H(x) significantly deviates from identity mapping?

Thanks!

Best Answer

The inception of residual learning was motivated by a widespread observation that adding layers to a neural network can sometimes lead to worse training/test error. Here's an image taken directly from the paper Deep Residual Learning for Image Recognition by He, et. alenter image description here

This phenomenon becomes especially consistent when the network is "deep," say, more than 20 layers. In particular this should be surprising because of the following observation: If we take the (trained) 20 layer network in the above image, and add on 26 more layers that literally do nothing, i.e. are the identity, then the 56 layer classifier should perform at least as well as the 20-layer one. The conclusion is that something in the way of backpropagation prevents this. One simple explanation for this fact is that neurons by their very nature use nonlinear activation functions, and so when tasked to produce identity (i.e. linear) activations, they overfit wildly. This is especially true when you add a whopping 26 layers, with millions of extra parameters. An analogy is using polynomials to fit a line from Wikipedia:

enter image description here

To me this also suggests that the majority of important feature extraction happens in the first 20 layers, which means that it becomes increasingly difficult to extract extra accuracy if the first 20 layers are not sufficiently tuned enough. So perhaps in this 56 layer network above, the first 20 layers strongly resemble the 20 layer network abov (I'm not sure if rigorous studies have ever been done on this).

From He's presentation, we start with an abstraction of a neural network as a set of layers, grouped in pairs:

enter image description here

Here $x$ is the input and $H(x)$ is the desired output. We then abstract the output from the first layer as $F(x)$, and change the objective: instead of fitting $H(x)$, we will fit $H(x)=F(x)+x$:

enter image description here

Before moving on lets review. The objective is to create $H(x)$ from $x$. We have an intermediate output $F(x)$. In classical deep learning, $F(x)$ would be trained in accordance with minimizing the loss on $H(x)$. In residual learning, we require that $H(x)=F(x)+x$, or in particular that $F(x):=H(x)-x$. This is a good idea for the following reasons:

1) Suppose that we want $H(x)=x$. Then by setting all weights to 0 in the first layer, this will give $F(x)=0$ and by construction this implies $H(x)=x$. This makes it trivial for the neural network to reproduce the identity in a cost-effective way that doesn't scream "overfitting." Note that this is not required, but left as an option. The main example here is that if you take something like pretrained VGG and stack on residual layers, then initially they will be very much reproducing the identity.

2) On the other hand, suppose $H(x)$ is close to the identity (but not equal), so that $H(x)-x$ has some meaningful fluctuations around 0. Then $F(x)$ can easily capture these fluctuations, because that's what neural networks are great at doing: capturing non-trivial fluctuations. Continuing the example of stacking residual layers onto VGG, they will gradually drift from the identity and extract more accuracy.