Solved – What does the term “receptive field size” in the MatConvNet library mean

conv-neural-networkmachine learningneural networks

I was reading MatConvNet's tutorial for (convolutional) deep learning and it said:

"…the receptive field size for the layer. This is the size (in
pixels) of the local image region that affects a particular element in
a feature map."

which makes sense with the traditional definition of a receptive field. Its usually thought as the number of pixels that affect a particular node in the feature map. However, when I went and do the exercise they have the following table:

     layer|    0|    1|    2|    3|    4|         5|
      type|input| conv| relu| conv| relu|      conv|
      name|  n/a|conv1|relu1|conv2|relu2|prediction|
----------|-----|-----|-----|-----|-----|----------|
   support|  n/a|    3|    1|    3|    1|         3|
  filt dim|  n/a|    1|  n/a|   32|  n/a|        32|
 num filts|  n/a|   32|  n/a|   32|  n/a|         1|
    stride|  n/a|    1|    1|    1|    1|         1|
       pad|  n/a|    1|    0|    1|    0|         1|
----------|-----|-----|-----|-----|-----|----------|
   rf size|  n/a|    3|    3|    5|    5|         7|

where we can see that the convolution layer (layer 3) has a rf size (receptive field) of size 5. I was wondering, how did they get that number for the receptive field? I thought that the receptive field just referred to the size of the image size of the input to compute a feature map, i.e. the same size as the filter size of that convolution layer (Thought, I am aware the concept can extend to lower layers as explained on chapter 9 of Begnio, Goodfellow, Courville BGC deep learning book). Regardless, even aware of the extension definition, I am still unsure how to the number 5 was obtained on layer 3. Any ideas?

Best Answer

Receptive field refers to the pixels in the input image which contribute to a feature in any layer of a network.

Layer 1: Each point in the feature map comes from 3x3 pixels from input image, so RF is 3

Layer 3: Each point in the feature map comes from 3x3 patch of feature map from layer 1 which in turn map to 3x3 pixels of input image, if you map it back you realize it maps to 5x5 patch of the input image.

Similarly for layer 5 you get a 7x7 patch.