If I take Fig.3 of the paper "Deep residual learning for image recognition", and look at the following piece of the residual network:
$3\times3$ conv, 64 filters
| (X) (suppose shape is 14*14*64
v
$3\times3$ conv, 128 filters, stride=2
| (X') (shape will be 7*7*128
v
$3\times3$ conv, 128 filters
|
v (F(X) (shape will be 7*7*128)
I thus must sum (element-wise) the result $X + F(X)$ which are of different shapes. However, while a $1\times1$ convolution can help to get the same depth (number of features), what is the traditional way to obtain the same width/height for both X and F(X)?
Should I compute MaxPooling(X, stride=2) + F(X)?
Best Answer
I happened to read this paper recently. This paper introduced a shortcut projection to match the dimension. You can see equation 2 in the paper. Three different projection options A, B, C is compared in page 6.
There is a tensorflow implementation of residual net. You can find implementation of shortcut projection in real code.
Good luck!