Solved – What are the effects of depth and width in deep neural networks

deep learningneural networks

How does depth and width in neural networks affect the performance of the network?

For example, He et al. introduced very deep residual networks and claimed “We obtain [compelling accuracy] via a simple but essential concept—
going deeper.”
On the other hand Zagoruyko and Komodakis argues that wide residual networks “are far superior over their commonly used thin and very deep counterparts.”

Can someone summarise the current (theoretical) understanding in deep learning about the effects of width and depth in deep neural networks?

Best Answer

The "Wide Residual Networks" paper linked makes a nice summary at the bottom of p8:

  • Widdening consistently improves performance across residual networks of different depth;
  • Increasing both depth and width helps until the number of parameters becomes too high and stronger regularization is needed;
  • There doesn’t seem to be a regularization effect from very high depth in residual net- works as wide networks with the same number of parameters as thin ones can learn same or better representations. Furthermore, wide networks can successfully learn with a 2 or more times larger number of parameters than thin ones, which would re- quire doubling the depth of thin networks, making them infeasibly expensive to train.

The paper focused on an experimental comparison between the two methods. Nonetheless, I believe theorectically (and the paper also states) one of the main reasons why the wide residual networks produces fast and more accurate result than previous works is because:

it is more computationally effective to widen the layers than have thousands of small kernels as GPU is much more efficient in parallel computations on large tensors.

I.e. wider residual networks allow many multiplications to be computed in parallel, whilst deeper residual networks use more sequential computations (since the computation depend on the previous layer).

Also regarding my third bullet point above:

the residual block with identity mapping that allows to train very deep networks is at the same time a weakness of residual networks. As gradient flows through the network there is nothing to force it to go through residual block weights and it can avoid learning anything during training, so it is possible that there is either only a few blocks that learn useful representations, or many blocks share very little information with small contribution to the final goal.

There are also some useful comments at the Reddit page regarding this paper.

Related Question