Solved – How to train the RPN in Faster R CNN

conv-neural-networkdeep learningobject detection

Link to paper

I'm trying to understand the region proposal network in faster rcnn. I understand what it's doing, but I still don't understand how training exactly works, especially the details.

Let's assume we're using VGG16's last layer with shape 14x14x512 (before maxpool and with 228×228 images) and k=9 different anchors. At inference time I want to predict 9*2 class labels and 9*4 bounding box coordinates. My intermediate layer is a 512 dimensional vector.
(image shows 256 from ZF network)
from the paper

In the paper they write

"we randomly sample 256 anchors in an image to compute the loss
function of a mini-batch, where the sampled positive and negative
anchors have a ratio of up to 1:1"

That's the part I'm not sure about. Does this mean that for each one of the 9(k) anchor types the particular classifier and regressor are trained with minibatches that only contain positive and negative anchors of that type?

Such that I basically train k different networks with shared weights in the intermediate layer? Therefore each minibatch would consist of the training data x=the 3x3x512 sliding window of the conv feature map and y=the ground truth for that specific anchor type.
And at inference time I put them all together.

I appreciate your help.

Best Answer

Does this mean that for each one of the 9(k) anchor types the particular classifier and regressor are trained with minibatches that only contain positive and negative anchors of that type?

No. RPN is looking whether any bounding box candidate (anchor) contains object of any known type or not. They call it objectness score in the article. The actual class of an object is considered in the loss function where a prediction of a class contributes to the loss (see $L_{cls}$ in Eq 1 on page 3 in the article).

Regarding your question, if you have 9 classes then the network will not be trained 9 times, but the cls layer which has 2*k dimension on your image will be 9*k (2 represents binary classification, 9 the number of your classes).

What is the role of anchors?

Your confusion was perhaps caused by the term anchor as you mention specific anchor type. Anchor types are not directly coupled with object types. As the article says, an anchor is a rectangle and anchors differ by their scale and aspect ratio so that for each region, different types of boxes (anchors) are tested (the blue ones on your image). This is because a car might require a different bounding box than a person. Of course, in practice, you will select these k box shapes (anchors) so that they would cover all your desired object shapes.

Whether an anchor is positive or negative, is determined by the loss function. Positive anchor is the one that 1) has highest overlap (IoU) with any ground truth bounding box or 2) its IoU overlap is above 0.7. Negative anchor is a box whose IoU is below threshold 0.3 for all ground truth boxes. The rest of the anchors are ignored in the loss function. As you see, this also allows the network to consider all of yours classes in one go (as bounding boxes from different object types are considered).

Multi-class classification is indeed difficult to understand from the paper as the authors are using only binary classifier in their own example.

Related Question