Solved – Why do residual networks work

conv-neural-networkmachine learningneural networksresidual-networks

I have a few questions about the paper Deep Residual Learning for Image Recognition by
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun.

The building blocks of residual networks can be viewed as follows: data passed to the right branch, convolution, scaling, convolution; then in the right branch: identity mapping or convolution; and after that, both branches' outputs are summed.

  1. Why does this allow the training of deep networks, escaping network saturation at deep levels? I didn't get the idea from the paper. Is this summation a reminder of what happened a few layers ago, a reference point? Or is it just clever regularization?

  2. How was the amount of right branch layers chosen?

  3. Why do we train scale layer on the right branch, according to this caffe architecture?

Best Answer

In short (from my cellphone), it works because the gradient gets to every layer, with only a small number of layers in between it needs to differentiate through.

If you pick a layer from the bottom of your stack of layers, it has a connection with the output layer which only goes through a couple of other layers. This means the gradient will be more pure.

It is a way to solve the vanishing gradient problem. And therefore models could be built even deeper.