I was reading MatConvNet's tutorial for (convolutional) deep learning and it said:
"…the receptive field size for the layer. This is the size (in
pixels) of the local image region that affects a particular element in
a feature map."
which makes sense with the traditional definition of a receptive field. Its usually thought as the number of pixels that affect a particular node in the feature map. However, when I went and do the exercise they have the following table:
layer| 0| 1| 2| 3| 4| 5|
type|input| conv| relu| conv| relu| conv|
name| n/a|conv1|relu1|conv2|relu2|prediction|
----------|-----|-----|-----|-----|-----|----------|
support| n/a| 3| 1| 3| 1| 3|
filt dim| n/a| 1| n/a| 32| n/a| 32|
num filts| n/a| 32| n/a| 32| n/a| 1|
stride| n/a| 1| 1| 1| 1| 1|
pad| n/a| 1| 0| 1| 0| 1|
----------|-----|-----|-----|-----|-----|----------|
rf size| n/a| 3| 3| 5| 5| 7|
where we can see that the convolution layer (layer 3) has a rf size (receptive field) of size 5. I was wondering, how did they get that number for the receptive field? I thought that the receptive field just referred to the size of the image size of the input to compute a feature map, i.e. the same size as the filter size of that convolution layer (Thought, I am aware the concept can extend to lower layers as explained on chapter 9 of Begnio, Goodfellow, Courville BGC deep learning book). Regardless, even aware of the extension definition, I am still unsure how to the number 5 was obtained on layer 3. Any ideas?
Best Answer
Receptive field refers to the pixels in the input image which contribute to a feature in any layer of a network.
Layer 1: Each point in the feature map comes from 3x3 pixels from input image, so RF is 3
Layer 3: Each point in the feature map comes from 3x3 patch of feature map from layer 1 which in turn map to 3x3 pixels of input image, if you map it back you realize it maps to 5x5 patch of the input image.
Similarly for layer 5 you get a 7x7 patch.