How do anchor’s play a part in the Region Proposal Network (RPN) in Faster-RCNN

computer visionmachine learning

I'm reading based on this article: https://tryolabs.com/blog/2018/01/18/faster-r-cnn-down-the-rabbit-hole-of-modern-object-detection
as well as this YouTube video: https://www.youtube.com/watch?v=X3IlbjQs190

I'm confused as to how anchors have anything to do with how the RPN works. In the article it mentions that the RPN network takes in the feature map obtained from the base network as input and applies a 3×3 convolution over it, obtaining a layer with 512 channels. I'm guessing the height and width of this layer should be around 14×14, since applying a 3×3 convolution shouldn't change the dimensions by that much. And then this layer is connected to two independent layers with a 1×1 convolution over the 512 channels layer, yielding one layer with a size of 4k (where k is the number of anchors) corresponding to the center locations and the dimensions per anchor and one layer with a size of 2k corresponding to the probability that each anchor is background or foreground.

I don't understand how applying a 1×1 convolution over a multidimension array gives a scalar (as the loss is computed with the last 1×1 convolutional layer) and nowhere does the article mention that there are any fully connected layers. Also, I don't understand where anchors come into play, as the RPN simply just does operations on the input feature map.

Best Answer

I don't understand how applying a 1x1 convolution over a multidimension array gives a scalar (as the loss is computed with the last 1x1 convolutional layer) and nowhere does the article mention that there are any fully connected layers.

The 1x1 convolution acts as a linear layer on a feature vector (for a $H\times W\times C_{in}$ image, where $H$ is the height, $W$ is the width and $C_{in}$ is the number of channels, at each spatial location we have a $1\times 1\times C_{in}$ feature vector). I.e. 1x1 convolution acts per pixel-location over all channels, by linearly mapping a feature vector of size $1\times 1\times C_{in}$ to $1\times 1\times C_{out}$.

If you are familiar with PyTorch, then the following two are equivalent (with the only difference in the input and output shapes):

nn.Linear(in_features=256, out_features=2*k)
nn.Conv1d(in_channels=256, out_channels=2*k, kernel_size=1)

Below is a sketch that shows how a 3x3 sliding window (red color) of the RPN is applied on some location (blue dot) of feature map with 512 channels.

Also, I don't understand where anchors come into play, as the RPN simply just does operations on the input feature map.

For each feature-map location (blue dot) there is associated a set of $k$ anchor boxes of fixed scales and aspect ratios. The regression head of the RPN outputs 4 values $t_x, t_y, t_w, t_h$ for each anchor box, which are then used to resize and move the center of each anchor box to get a region-proposal (together with objectness score obtained by the classification branch (softmax classifier) of the RPN).

Related Solutions

Solved – Cannot make this autoencoder network function properly (with convolutional and maxpool layers)

You might gain more insight by visualizing the weights instead of just the reconstructions. I had a similar problem when my biases were misconfigured. Everything below is written based on my experiences writing my own learning library. You can see the code here on Github http://github.com/josephcatrambone/aij.

Here is a screenshot of my program when there are no biases. This is after only maybe ten epochs since I'm in a hurry to finish this writeup:

The weight update is done by these operations:

weights.add_i(positiveProduct.subtract(negativeProduct).elementMultiply(learningRate / (float) batchSize));
//visibleBias.add_i(batch.subtract(negativeVisibleProbabilities).meanRow().elementMultiply(learningRate));
//hiddenBias.add_i(positiveHiddenProbabilities.subtract(negativeHiddenProbabilities).meanRow().elementMultiply(learningRate));

If I uncomment the visible bias code, I get this result:

If I screw up the sign of the visible bias code (subtracting instead of adding):

visibleBias.subtract_i(batch.subtract(negativeVisibleProbabilities).meanRow().elementMultiply(learningRate));

I get this image:

Which snowballs and eventually reaches something like what you have above. Check the signage of your error functions.

Solved – Bottleneck building block in Residual learning networks

There's only one parameter for each input map in a 1*1 filter, actually the 1*1 convolution is multiplying the every element of an input map by the same scalar.

So it is similar to getting 265 linear combinations out of 64 variables, the $n$-th feature map $y_n$ is like,

$$y_n=f(w_{n,1}x_1+w_{n,2}x_2+...+w_{n,64}x_{64})$$ so actually we can get any number of output feature maps as we want. Of course if the output dimension is greater than the input dimension, the output would be redundant.

Best Answer

Related Solutions

Solved – Cannot make this autoencoder network function properly (with convolutional and maxpool layers)

Solved – Bottleneck building block in Residual learning networks

Related Question