Solved – 2D convolution with depth

conv-neural-networkconvolutionmachine learningneural networks

Lets say I have a convolutional neural network where my input images are of dimensions 25x25x3 (3 depth channels for colour) and pass it through a convolution layer of 5 kernels, each 3×3

The depth of each kernel will always be the same as the input depth, so my kernels are actually 3x3x3

The layer will convolve each 3x3x3 kernel over the 25x25x3 input image. Each kernel convolution will produce a 25x25x1 feature map (which then get stacked to produce the output volume of 25x25x5)

I'm confused as to how 2D convolutions (with depth 3) produce a feature map with only depth 1

I'm imagining separate convolutions over the spatial dimensions (3×3 over 25×25, separately for each of the 3 depth channels). How do the spatial convolutions across 3 depth channels then get condensed down to 1 output depth channel (for each kernel)? What is the operation there? Is it simply summation? max? or something else?

Best Answer

You already understand that the dimension of a single kernel is 3x3x3 and there are 5 kernels. So each kernel is a 2D window of 3x3 pixels and there are 3 components in each kernel, one corresponding to each color channel (R,G,B). When the kernel is placed at a particular location over the input image its 3 components are multiplied (dot product) with the corresponding channel's pixel data to produce a single scalar number for every component (or channel). So you get 3 scalars, one for each channel. Then these scalars are summed up and another scalar representing the bias of the filter is added to the sum. The end result is a single scalar.

You can view an animated demo under the convolution demo section on this page. Use the Toggle Movement button to pause the animation and look at how the output is being generated.

convolution operation In the screenshot above (taken at a random instance) the portions of the image channels to which convolution is being applied are outlined in Blue. The three components of filter W1 are outlined in Red. If you take the dot product of each 3x3 Blue rectangle with it's component of the filter you get a scalar. The 3 scalars obtained for each channel are summed up along with the bias of Filter W1, shown as a single cell below the filter components. The result is a single scalar, the number 2, outlined green in the last column.

The text also describes how convolution is being carried out.