Convolutional Neural Networks – Clarifications and Insights

conv-neural-networkdeep learningmachine learningneural networks

When reading about the transforming the fully connected layer into convolutional layer, posted in http://cs231n.github.io/convolutional-networks/#convert.

I just feel confused about the following two comments:

It turns out that this conversion allows us to "slide" the original ConvNet very efficiently across many spatial positions in a larger image, in a single forward pass.

A standard ConvNet should be able to work on any size image. The convolutional filter can slide across the image grid, so why do we need to slide the original ConvNet in any spatial positions in a larger image?

And

Evaluating the original ConvNet (with FC layers) independently across 224×224 crops of the 384×384 image in strides of 32 pixels gives an identical result to forwarding the converted ConvNet one time.

What does "strides of 32 pixels" mean here? Does that refer to the filter size? When talking about 224*224 crops of the 384*384 image, does that mean we use a receptive field of 224*224?

I marked these two comments as red in the original context.

enter image description here

Best Answer

Fully connected layers can only deal with input of a fixed size, because it requires a certain amount of parameters to "fully connect" the input and output. While convolutional layers just "slide" the same filters across the input, so it can basically deal with input of an arbitrary spatial size.

In the example network with fully-connected layers at the end, a 224*224 image will output an 1000d vector of class scores. If we apply the network on a larger image, the network will fail because of the inconsistency between the input and parameters of the first fully-connected layer.

One the other hand, if we use a fully convolutional network, when applied to a larger image we'll get 1000 "heatmaps" of class scores.

As shown in the following figure (from the FCN segmentation paper), the upper network gives one score per class, and after the conversion (convolutionalization), we can get the a heatmap per class for a larger image.
enter image description here

About "stride", on the same page, in the section Spatial arrangement:

When the stride is 1 then we move the filters one pixel at a time. When the stride is 2 (or uncommonly 3 or more, though this is rare in practice) then the filters jump 2 pixels at a time as we slide them around. This will produce smaller output volumes spatially.