Solved – Connection between filters and feature map in CNN

conv-neural-networkdeep learningmachine learningneural networks

I am learning CNN with TensorFlow and Python.

I do not understand the connection between layer $\ell$ and layer $\ell+1$. For example, for the input image and the first layer, it is easy as there is only one input, and hence there is as many feature maps as filters, and each filter gets to be 'multiplied' by the input image. The resulting feature map is of size ((input_size - filter_size + 2*padding) / stride) + 1.

A similar question and answer clearly responds to his particular example:
How are filters and activation maps connected in Convolutional Neural Networks?

But in general I still don't understand.
When we build a CNN, for example:

model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) 
model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) 

How does TensorFlow make the connection between 32 features map to the next feature map? Typically I would think that with 64 filters and 32 feature maps from the previous layer we would get 64*32 feature maps in the next layer (all features are connected to each filter). But I think that above code will result in 64 feature maps.

Best Answer

Your confusion stems from the fact that channels (feature maps) are treated somewhat differently than other dimensions.

Let's say you have a grayscale image input to the first layer and 32 kernels of shape (3,3) as per your example. But in fact, those kernels have shape (3,3,1) - 1 for the number of channels in the input. For a RGB input image it would be 3. The number of channels is simply omitted in the code because it is inferred automatically from the number of channels of the layer input.

The output of this layer has 32 channels (1 per each kernel). In the second layer in your example, you have 64 kernels of shape (3,3), but they are in fact (3,3,32)! Each of these kernels is aggregating information from all input feature maps.


Typically I would think that with 64 filters and 32 feature maps from the previous layer we would get 64*32 feature maps in the next layer (all features are connected to each filter).

I hope that from the above explanation it is clear that you are not applying each of the 64 kernels on each of the 32 feature maps individually. Instead, each of these 64 kernels is looking at all of the 32 feature maps at the same time, having different weights for each of them.