How do anchor’s play a part in the Region Proposal Network (RPN) in Faster-RCNN

computer visionmachine learning

I'm reading based on this article: https://tryolabs.com/blog/2018/01/18/faster-r-cnn-down-the-rabbit-hole-of-modern-object-detection
as well as this YouTube video: https://www.youtube.com/watch?v=X3IlbjQs190

I'm confused as to how anchors have anything to do with how the RPN works. In the article it mentions that the RPN network takes in the feature map obtained from the base network as input and applies a 3×3 convolution over it, obtaining a layer with 512 channels. I'm guessing the height and width of this layer should be around 14×14, since applying a 3×3 convolution shouldn't change the dimensions by that much. And then this layer is connected to two independent layers with a 1×1 convolution over the 512 channels layer, yielding one layer with a size of 4k (where k is the number of anchors) corresponding to the center locations and the dimensions per anchor and one layer with a size of 2k corresponding to the probability that each anchor is background or foreground.

I don't understand how applying a 1×1 convolution over a multidimension array gives a scalar (as the loss is computed with the last 1×1 convolutional layer) and nowhere does the article mention that there are any fully connected layers. Also, I don't understand where anchors come into play, as the RPN simply just does operations on the input feature map.

Best Answer

I don't understand how applying a 1x1 convolution over a multidimension array gives a scalar (as the loss is computed with the last 1x1 convolutional layer) and nowhere does the article mention that there are any fully connected layers.

The 1x1 convolution acts as a linear layer on a feature vector (for a $H\times W\times C_{in}$ image, where $H$ is the height, $W$ is the width and $C_{in}$ is the number of channels, at each spatial location we have a $1\times 1\times C_{in}$ feature vector). I.e. 1x1 convolution acts per pixel-location over all channels, by linearly mapping a feature vector of size $1\times 1\times C_{in}$ to $1\times 1\times C_{out}$.

If you are familiar with PyTorch, then the following two are equivalent (with the only difference in the input and output shapes):

nn.Linear(in_features=256, out_features=2*k)
nn.Conv1d(in_channels=256, out_channels=2*k, kernel_size=1)

Below is a sketch that shows how a 3x3 sliding window (red color) of the RPN is applied on some location (blue dot) of feature map with 512 channels.

enter image description here

Also, I don't understand where anchors come into play, as the RPN simply just does operations on the input feature map.

For each feature-map location (blue dot) there is associated a set of $k$ anchor boxes of fixed scales and aspect ratios. The regression head of the RPN outputs 4 values $t_x, t_y, t_w, t_h$ for each anchor box, which are then used to resize and move the center of each anchor box to get a region-proposal (together with objectness score obtained by the classification branch (softmax classifier) of the RPN).