Solved – How are filters and activation maps connected in Convolutional Neural Networks

conv-neural-networkdeep learningmachine learning

How are the activation maps at a given layer connected to the filters for that layer?
I'm not asking about how to do a convolutional operation between the filter and the activation map, I am asking about the type of connectivity these two have.

For example, say you wanted to do full connectivity. You have f number of filters and n number of activation maps in a given layer. You wold get f * n number of activation maps in the next layer, and the number of activation maps would increase with each new layer. This is the way I assume it is done.

Or you could say each filter is connected to just one activation map. In this case, the number of filters would equal the number of activation maps, and every layer would have the same number of filters and activation maps. This is the current architecture of my network, and it seems to learn fine.

The main source of my confusion is looking at diagrams of convnets I see online. Some of them have the "full connection" between filters and activation maps, such as this –
enter image description here
In the first layer you have 4 activation maps, and presumably 2 filters. Each map is convolved with each filter, resulting in 8 maps in the next layer. Looks great.

But here we have an architecture that doesn't make sense to me –
enter image description here
How do you go from 6 maps in the first layer to 16 in the 2nd? I can think of ways to get 16 maps from 6, but they wouldn't make any sense to do.

Best Answer

The second convolutional neural network (CNN) architecture you posted comes from this paper. In the paper the authors give a description of what happens between layers S2 and C3. Their explanation is not very clear though. I'd say that this CNN architecture is not 'standard', and it can be quite confusing as a first example for CNNs.

CNN architecture

First of all, a clarification is needed on how feature maps are produced and what their relationship with filters is. A feature map is the result of the convolution of a filter with a feature map. Let's take the layers INPUT and C1 as an example. In the most common case, to get 6 feature maps of size $28 \times 28$ in layer C1 you need 6 filters of size $5 \times 5$ (the result of a 'valid' convolution of an image of size $M \times M$ with a filter of size $N \times N$, assuming $M \geq N$, has size $(M-N+1) \times (M-N+1)$. You could, however, produce 6 feature maps by combining feature maps produced by more or less than 6 filters (e.g. by summing them up). In the paper, nothing of the sort is implied though for layer C1.

What happens between layer S2 and layer C3 is the following. There are 16 feature maps in layer C3 produced from 6 feature maps in layer S2. The number of filters in layer C3 is indeed not obvious. In fact, from the architecture diagram only, one cannot judge what the exact number of filters that produce those 16 feature maps is. The authors of the paper provide the following table (page 8):

connections between layers S2 and C3

With the table they provide the following explanation (bottom of page 7):

Layer C3 is a convolutional layer with 16 feature maps. Each unit in each feature map is connected to several $5 \times 5$ neighborhoods at identical locations in a subset of S2's feature maps.

In the table the authors show that every feature map in layer C3 is produced by combining 3 or more feature maps (page 8):

The first six C3 feature maps take inputs from every contiguous subsets of three feature maps in S2. The next six take input from every contiguous subset of four. The next three take input from some discontinuous subsets of four. Finally, the last one takes input from all S2 feature maps.

Now, how many filters are there in layer C3? Unfortunately, they do not explain this. The two simplest possibilities would be:

  1. There is one filter per S2 feature map per C3 feature map, i.e. there is no filter sharing between S2 feature maps associated with the same C3 feature map.
  2. There is one filter per C3 feature map, which is shared across the (3 or more) feature maps of layer S2 that are combined.

In both cases, to 'combine' would mean that the results of convolution per S2 feature map group, would need to be combined to produced C3 feature maps. The authors do not specify how this is done, but addition is a common choice (see for example the animated gif near the middle of this page.

The authors give some additional information though, which can help us decipher the architecture. They say that 'layer C3 has 1,516 trainable parameters' (page 8). We can use this information to decide between cases (1) and (2) above.

In case (1) we have $(6 \times 3) + (9 \times 4) + (1 \times 6) = 60$ filters. The filter size is $(14-10+1) \times (14-10+1) = 5 \times 5$. The number of trainable parameters in this case would be $5 \times 5 \times 60 = 1,500$ trainable parameters. If we assume one bias unit per C3 feature map, we get $1,500 + 16 = 1,516$ parameters, which is what the authors say. For completeness, in case (2) we would have $(5 \times 5 \times 16) + 16 = 416$ parameters, which is not the case.

Therefore, if we look again at Table I above, there are 10 distinct C3 filters associated with each S2 feature map (thus 60 distinct filters in total).

The authors explain this type of choice:

Different feature maps [in layer C3] are forced to extract different (hopefully complementary) features because they get different sets of inputs.

I hope this clarifies the situation.