I have a sequence of images of shape $(40,64,64,12)$. If I apply conv3d with 8 kernels having spatial extent $(3,3,3)$ without padding, how to calculate the shape of output.
If the next layer is max pooling with $(2,2,2)$, what will be the output shape?
Best Answer
The convolution formula is the same as in 2D and is well-described in CS231n tutorial:
$$Out = (W−F+2P)/S+1$$
... where $W$ is the input volume size, $F$ is the receptive field size, $S$ is the stride, and $P$ is the amount of zero padding used on the border. In particular, when $S=1$ and $P=0$, like in your question, it simplifies to
$$Out=W-F+1$$
So, if you input the tensor $(40,64,64,12)$, ignoring the batch size, and $F=3$, then the output tensor size will be $(38, 62, 62, 8)$.
Pooling layer normally halves each spatial dimension. This corresponds to the local receptive field size
F=(2, 2, 2)
and strideS=(2, 2, 2)
. Hence, the input tensor $(38, 62, 62, 8)$ will be transformed to $(19, 31, 31, 8)$.But you set the stride
S=(1, 1, 1)
, it'll reduce each spatial dimension by 1: $(37, 61, 61, 8)$.The last dimension doesn't change.