Convolutional Neural Networks – Why Build the Sum of a Filter Over Multiple Channels?

conv-neural-networkconvolution

Let's say I have RGB input data (3 channels) and a convolutional layer which has just one filter with a depth of 3. The output data will have a depth of 1 if we build the sum over the results of the convolution of every channel. But why do sum up the results? Why not build the average or add 17 all time?

Some thoughts:
It seems like we might lose information due to the summation. For example, if there is a positive edge on the red channel but a negative edge on the blue channel they will cancel each other out. Okay, the weights can be different for each channel that might help, but I still don't see the advantage of a summation over other operations.

R (1. channel) conv Filter 1 [x:x:1] \
                                      \
G (2. channel) conv Filter 1 [x:x:2]    => Sum => output [x:x:1] WHY?
                                      /
B (3. channel) conv Filter 1 [x:x:3] /

EDIT:
Here is a much better graphic (scroll down to the gif).
http://cs231n.github.io/convolutional-networks/#conv

Best Answer

Okay, the weights can be different for each channel that might help, but I still don't see the advantage of a summation over other operations.

Exactly. You miss the fact that weights are learnable. Of course, initially it's possible that the edges from different channels will cancel each other and the output tensor will lose this information. But this will result in big loss value, i.e., big backpropagation gradients, that will tweak the weights accordingly. In reality, the network learns to capture the edges (or corners or more complex patterns) in any channel. When the filter does not match the patch, the convolution result is very close to zero, rather than large negative number, so that nothing is lost in the sum. (In fact, after some training, most of the values in the kernels are close to zero.)

The reason to do a summation is because it's efficient (both forward and backward operations are vectorizable) and allows nice gradients flow. You can view this as a sophisticated linear layer with shared weights. If the concern you express was actual, you would see the same problem in all linear layers in any network: when different features are summed up with some weights, they can cancel each other out, right? Luckily, this does not happen (unless the features are correlated, e.g., specially crafted), due to reasons described earlier, so the linear operation is the crucial building block of any neural network.