Solved – What problem does Residual Nets solve that batch normalization does not solve

deep learningmachine learningneural networks

I was trying to understand how the contributions of Residual Nets differed from batch normalization. I have read both papers but its still not clear to me.

As far as I can tell batch normalization essentially solved the issue of vanishing and exploding gradients. However, intuitively for me it seems that in principle this was mainly caused by depth of a network. Thus, why is it that batch normalization is not able to train networks that are as deep as residual networks? What is special about ResNets that batch norm does not do?

How do the contributions of ResNets and batch norm differ? Is it that skip connection can learn the identity better and thus much deep nets are trainable? Do ResNets not work without batch norm meaning ResNets do not solve the vanishing/exploding gradient problem at all?

How do the contributions differ?

Best Answer

"Skip connections eliminate singularities" by A. Emin Orhan, Xaq Pitkow offers an explanation: residual connections ameliorate singularities in neural networks.

Skip connections made the training of very deep networks possible and have become an indispensable component in a variety of neural architectures. A completely satisfactory explanation for their success remains elusive. Here, we present a novel explanation for the benefits of skip connections in training very deep networks. The difficulty of training deep networks is partly due to the singularities caused by the non-identifiability of the model. Several such singularities have been identified in previous works: (i) overlap singularities caused by the permutation symmetry of nodes in a given layer, (ii) elimination singularities corresponding to the elimination, i.e. consistent deactivation, of nodes, (iii) singularities generated by the linear dependence of the nodes. These singularities cause degenerate manifolds in the loss landscape that slow down learning. We argue that skip connections eliminate these singularities by breaking the permutation symmetry of nodes, by reducing the possibility of node elimination and by making the nodes less linearly dependent. Moreover, for typical initializations, skip connections move the network away from the “ghosts” of these singularities and sculpt the landscape around them to alleviate the learning slow-down. These hypotheses are supported by evidence from simplified models, as well as from experiments with deep networks trained on real-world datasets.

I'm not aware of a layer or batch normalization strategy that can accomplish this.

Related Question