Solved – Residual network dimension changing blocks identity function

conv-neural-networkdeep learningmachine learningresidual-networksresiduals

In trying to implement ResNet with bottleneck blocks for myself, I got very confused about the identity function residual blocks with different dimensions. They compared identity, conv projections on multidimensional blocks, conv projections on all blocks. I decided to go with identity as the other don't increase the accuracy substantially but the increase training memory and parameter counts. I noticed the identity is actually an identity of stride 2. (To my understanding, this is essentially max pooling with a kernel of (1,1) and stride (2,2), and concatenating a bunch of zeros). Meaning you lose 3/4s of the identity, and have a new matrix 'identity' of size

(num_filters * 2, n/2, n/2)

and the back half full of zeros. Where n is the 'height' and 'width' dimensions, and num_filters is the number of filters from the previous layer.

E.g. first different-dimensional bottleneck block (omitting batch norm and activation)

Input (256,56, 56)

Conv (kernel 1x1, 128 filters, stride 1) (128, 56, 56)

Conv (kernel 3x3, 128 filters, stride 2, pad 1) (128, 28, 28)

Conv (kernel 1x1, 512 filters, stride 1) (512, 28, 28)

Sum (last_conv, confusing_identity)

Hence confusing_identity is of size 512, 28, 28

Wouldn't a max pooling of kernel (3,3) pad (1,1) and stride(2,2) encode more information than this lossy identity?

Also, does any of this really affect training time substantially? (Since both implementations have 0 parameters)

Best Answer

Instead of max pooling you can use 2D average pooling with stride (2,2) and kernel size (2,2), optionally concatenating it with itself to get 512 features instead of 256.

The benefit of this is that average pooling is linear operation which does not prevent gradient propagation and does not lose any information.

The problem with the 1x1 strided max pooling is, as you said, that 3/4 of the information gets discarded. The problem with your proposed solution is that the max pooling is still non-linear, which is contradictory to the aim of residual networks: finding residual function to a linear mapping.