Solved – Problem figuring out the inputs to a fully connected layer from convolutional layer in a CNN

conv-neural-networkmachine learningneural networks

The question is on the mathematical details of the convolutional neural networks. Assume that the architecture of the net (objective of which is image classification) is as such

  • Input image 32×32
  • First hidden layer 3x28x28 (formed by convolving with 3 filters of
    size 5×5, stride length = 0 and no padding), followed by
    activation
  • Pooling layer (pooling over a 2×2 region) producing an output of
    3x14x14
  • Second hidden layer 6x10x10 (formed by convolving with 6 filters
    of size 5×5, stride length = 0 and no padding), followed by
    activation
  • Pooling layer (pooling over a 2×2 region) producing an output of
    6x5x5
  • Fully connected layer (FCN) -1 with 100 neurons
  • Fully connected layer (FCN) -2 with 10 neurons

From my readings thus far, I have understood that each of the 6x5x5 matrices are connected to the FCN-1. I have two questions, both of which are related to the way output from one layer is fed to another.

  1. The output of the second pooling layer is 6x5x5. How are these fed to the FCN-1? What I mean is that each neuron in the FCN-1 can be seen as node that takes a scalar as input (or a 1×1 matrix). So how do we feed it an input of 6x5x5? I initially thought we’d flatten out the 6x5x5 matrices and convert it into a 150×1 array and then feed it to the neuron as if we have 150 training points. But doesn’t flattening out the feature map defeat the argument of spatial architecture of images?
  2. From the first pooling layer we get 3 feature maps of size 14×14. How are the feature maps in the second layer generated? Lets say I look at the same region (a 5×5 area starting from the top left of the feature maps) across the 3 feature maps I get from the first convolutional layer. Are these three 5×5 patches used as separate training examples to produce the corresponding region in the next set of feature maps? If so then what if the three feature maps are instead RGB values of an input image? Would we still use them as separate training examples?

Best Answer

  1. You are correct with the idea of flattening it into a vector with 150 values. You can actually take your 6x5x5 output from your last pooling and connect it in any order you want and it will work the same, as long as you keep that order consistent across all training examples. The reason behind this is that each unit in the FC layer takes a weighted sum of ALL outputs from your pooled layer, and the order you do a sum in doesn't change the result. For example, (3 * 4 * 2) = (2 * 4 * 3)
    You are also correct that flattening it ends the spatial relationships. This doesn't hurt because after a few conv/pool layers the spatial relationships begin to get lost anyways, and what each activation represents gets increasingly abstract. Part of the idea behind pooling is to make things more spatially invariant. For example, if you had an image of a dog that was towards the left half of the image and you looked at the activation values you got on your final pooling layer, then compared this to the same image but with the dog further over towards the right half, the activations should be similar. This is a huge part of where conv nets get their power.

  2. The feature maps on the second layer are generated in the exact same way that you got the first 3 feature maps from your input image. You may be missing the idea that the filters on a given layer are 3D volumes, and their depth is equal to how many feature maps you got from the previous layer. So on your first convolution, if you are using an RGB image, your filters will be 5x5x3 (this way the filter looks across all 3 color channels to produce an activation value). If you used 10 filters on your first layer, then your next layer's filters will be 5x5x10.
    Each feature map produced by convolving a 3D filter with all of the previous layer's feature maps (also a 3D volume when all stacked together) comes out as a 2D map, then in the next layer you stack each of these maps that were produced to create a new 3D volume to be convolved with a new set of filters.

Related Question