Solved – How Single Shot Detectors (SSD) object detection calculates it’s class scores and bbx locations

computer visiondeep learningobject detection

As in the paper I can understand SSD try to predict object locations and their relevant class scores from different feature maps .
SSD

So for each layers there can be different predictions with respect to number of anchor(reference) boxes in different scale.

So if one convolutional feature map has 5 reference boxes there should be class scores and bbx coordinates for each of the reference box .

We do above predictions by sliding a window(kernel Ex: 3*3) over the feature maps of different layers . So what I not clear is connection from sliding window at a position to score layer .

1. It just connection of convolution window output to score layer in a fully connected way ?
2.Or we do some other operation for convolution window output before connecting it to score layer ?

Best Answer

The class score and bbx predictions are obtained by convolution. It's the difference between YOLO and SSD . SSD doesn't go for a fully connected way. I will explain how the score function is taken .

Above is a 8 *8 spacial sized feature map in a ssd feature extractor model. For each position in the feature map we gonna predict following

  • 4 BBX coordinates w.r.t default boxes (showed in dotted lines)
  • class scores for each default boxes (c number of classes)

Let's say if we have k number of default (anchor) boxes we predict *(4+c)K

Now the tricky part . How we get those scores .

  • Here we use set of convolutional kernels which have depth of the feature map. (normally 3*3)
  • Since there are (4+C) predictions w.r.t single anchor box it's like we have (4+C) above mentioned kernels which have depth of feature map. So it's more like set of filters .

These set of filters will predict above (4+c) scalars.

So for a single feature map , if there are K number anchor box which we reference them in prediction ,

We have **K *(4+c) filters(3*3 in spacial location) are applied around each location of the feature map in a sliding window manner .**

We train those filter values ! .