Solved – Shortcut connections in ResNet with different spatial sizes

deep learningresidual-networkstensorflow

If I take Fig.3 of the paper "Deep residual learning for image recognition", and look at the following piece of the residual network:

$3\times3$ conv, 64 filters

   | (X) (suppose shape is 14*14*64
   v

$3\times3$ conv, 128 filters, stride=2

   | (X') (shape will be 7*7*128
   v

$3\times3$ conv, 128 filters

   |
   v (F(X) (shape will be 7*7*128)

I thus must sum (element-wise) the result $X + F(X)$ which are of different shapes. However, while a $1\times1$ convolution can help to get the same depth (number of features), what is the traditional way to obtain the same width/height for both X and F(X)?

Should I compute MaxPooling(X, stride=2) + F(X)?

Best Answer

I happened to read this paper recently. This paper introduced a shortcut projection to match the dimension. You can see equation 2 in the paper. Three different projection options A, B, C is compared in page 6.

There is a tensorflow implementation of residual net. You can find implementation of shortcut projection in real code.

Good luck!