Solved – What does anchors’ scales actually refer to in Faster RCNN

deep learningobject detection

I am trying to understand the faster RCNN but I can't understand the meaning of anchors' scales?
Especially in this article Faster RCNN.

The author considers 3 scales $(128^2, 256^2, 512^2)$

What does this line mean?

I know that for each $3*3$ spatial location in the feature map (VGG) we perform convolution and after that we do conv for each anchor box.
the receptive field of those $3*3$ spatial locations are $(16*3)^2$ in the original image and I think that that means the anchors area should be smaller than $(16*3)^2$. Isn't $512^2$ too big for an anchor? What happens when it's near the edge of the original image?

Best Answer

This is addressed by the authors of the paper in section 3.3:

We note that our algorithm allows predictions that are larger than the underlying receptive field. Such predictions are not impossible—one may still roughly infer the extent of an object if only the middle of the object is visible.

Moreover the authors compute the receptive field of the RPN on top of the VGG feature map to be 228 by 228, which is larger than the 48 by 48 you suggested, and comes quite close to 512 by 512.