How does depth and width in neural networks affect the performance of the network?
For example, He et al. introduced very deep residual networks and claimed “We obtain [compelling accuracy] via a simple but essential concept—
going deeper.”
On the other hand Zagoruyko and Komodakis argues that wide residual networks “are far superior over their commonly used thin and very deep counterparts.”
Can someone summarise the current (theoretical) understanding in deep learning about the effects of width and depth in deep neural networks?
Best Answer
The "Wide Residual Networks" paper linked makes a nice summary at the bottom of p8:
The paper focused on an experimental comparison between the two methods. Nonetheless, I believe theorectically (and the paper also states) one of the main reasons why the wide residual networks produces fast and more accurate result than previous works is because:
I.e. wider residual networks allow many multiplications to be computed in parallel, whilst deeper residual networks use more sequential computations (since the computation depend on the previous layer).
Also regarding my third bullet point above:
There are also some useful comments at the Reddit page regarding this paper.