Solved – How to calculate output shape in 3D convolution

conv-neural-networkmachine learningneural networks

I have a sequence of images of shape $(40,64,64,12)$. If I apply conv3d with 8 kernels having spatial extent $(3,3,3)$ without padding, how to calculate the shape of output.

If the next layer is max pooling with $(2,2,2)$, what will be the output shape?

Best Answer

The convolution formula is the same as in 2D and is well-described in CS231n tutorial:

$$Out = (W−F+2P)/S+1$$

... where $W$ is the input volume size, $F$ is the receptive field size, $S$ is the stride, and $P$ is the amount of zero padding used on the border. In particular, when $S=1$ and $P=0$, like in your question, it simplifies to

$$Out=W-F+1$$

So, if you input the tensor $(40,64,64,12)$, ignoring the batch size, and $F=3$, then the output tensor size will be $(38, 62, 62, 8)$.


Pooling layer normally halves each spatial dimension. This corresponds to the local receptive field size F=(2, 2, 2) and stride S=(2, 2, 2). Hence, the input tensor $(38, 62, 62, 8)$ will be transformed to $(19, 31, 31, 8)$.

But you set the stride S=(1, 1, 1), it'll reduce each spatial dimension by 1: $(37, 61, 61, 8)$.

The last dimension doesn't change.

Related Question